This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library. The primary goal is to showcase the simplicity and efficiency of using Prefect v3.0 as an orchestration tool for developing data pipelines.
-
Task 1: Fetch Data Retrieve historical data for Dow Jones (^DJI) using yfinance and convert it to a Polars DataFrame.
-
Task 2: Process Data Perform basic transformations like filtering, renaming columns, and setting date formats using polars.
-
Task 3: Save Data Save the cleaned and processed data to a .parquet file.
-
Prefect Orchestration Each step of the pipeline is orchestrated using Prefect
-
Clone the repository:
git clone https://github.com/NolanMM/Perfect_v3.0_Data_Engineering_Pipeline.git cd Perfect_v3.0_Data_Engineering_Pipeline -
Create and activate a virtual environment:
python -m venv venv // Linux source venv/bin/activate // Window cd ./venv/Scripts activate -
Install the required dependencies:
pip install -r requirements.txt
1. Activate the Virtual Environment Open a separate terminal and activate the Python virtual environment:
# On Linux: source venv/bin/activate
# On Windows: venv\Scripts\activate2. Start Prefect Server In the same terminal, start the Prefect server to enable orchestration and monitoring:
prefect server startThis will launch the Prefect server locally. By default:
- The server UI is accessible at http://127.0.0.1:4200/dashboard
- Keep this terminal running during the execution of the pipeline.
3. Run the Pipeline Open a new terminal, navigate to the project directory, activate the virtual environment,
# On Linux: source venv/bin/activate
# On Windows: venv\Scripts\activateand execute the pipeline:
$env:PREFECT_API_URL="http://127.0.0.1:4200/api"; python Simple_Data_Pipeline.py4. Monitor Pipeline Execution Visit the Prefect server UI in your browser (http://127.0.0.1:4200/dashboard) to monitor task execution, inspect logs, and troubleshoot any issues.
- Data Retrieval: Fetch historical data for ^DJI using yfinance.
- Data Processing: Perform basic data cleaning and processing using polars.
- Data Storage: Save the processed data to a .parquet file.
- Orchestration: Use Prefect v3.0 to orchestrate the data pipeline.
- Python 3.8+
- Libraries:
- yfinance
- polars
- prefect (v3.0 or higher)


