This project is a real-time data pipeline that ingests, processes, and visualizes carbon intensity data from the UK National Energy System Operator (NESO) API.
It demonstrates a full "Medallion Architecture" (Bronze to Silver to Gold) using Apache Spark Structured Streaming, Delta Lake, and Kafka, all resulting in a live Streamlit dashboard.
The pipeline follows a strictly linear flow designed for robustness and data quality:
- Producer Service: A Python poller fetches live JSON data from the Carbon Intensity API every 60 seconds and pushes it to Kafka.
- Ingestion Layer (Bronze): A Spark Streaming job reads from Kafka and appends raw JSON events to a Bronze Delta Table. This layer preserves history and allows for replayability.
- Transformation Layer (Silver): A second Spark Streaming job reads from Bronze, performs deduplication (handling 30-minute rolling windows), enforces schema, and upserts clean data into a Silver Delta Table.
- Serving Layer (Dashboard): A Streamlit app polls the Silver table to provide a live, auto-refreshing interface for monitoring grid carbon intensity.
Before running the project, ensure you have the following installed:
- Python 3.10+
- Java 17 (OpenJDK): Required for Spark.
- Apache Kafka: You can run this locally or use a cloud provider.
- Hadoop Binaries (Windows only): If on Windows, ensure your
winutils.exeandhadoop.dllare set up.
-
Clone the Repository
git clone https://github.com/YourUsername/Streaming-Lakehouse-for-NESO-UK-Carbon-Intensity-Analytics.git cd Streaming-Lakehouse-for-NESO-UK-Carbon-Intensity-Analytics -
Set Up Virtual Environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install Dependencies
pip install -r spark/requirements.txt
Run each component in a separate terminal window to observe the full real-time flow.
This script polls the API and sends data to your Kafka topic carbon.intensity.uk.
python producer/poller.pyThis Spark job consumes the Kafka stream and writes raw data to the Delta Lake Bronze table.
python spark/01_kafka_to_bronze.pyWait for the "Batch processed" logs to appear.
This job processes the Bronze data, handling deduplication and merging updates into the Silver table.
python spark/02_bronze_to_silver.pyFinally, spin up the Streamlit interface to visualize the data live.
streamlit run dashboard/app.pyOnce running, you can monitor the UK's carbon intensity in real-time, including forecast vs. actual deltas and streaming metrics.
producer/: Contains thepoller.pyservice.spark/: Contains the PySpark streaming jobs (01_kafka_to_bronze.py,02_bronze_to_silver.py).dashboard/: Contains the Streamlitapp.py.data/: Local storage for Delta Tables and checkpoints (automatically generated).
This project is open-source and available under the MIT License.

