Skip to content

This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library

License

Notifications You must be signed in to change notification settings

NolanMM/Perfect_v3.0_Data_Engineering_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Data Pipeline

Retrieve and Process Dow Jones (DJI) Data Using Prefect v3.0


I. Overview

This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library. The primary goal is to showcase the simplicity and efficiency of using Prefect v3.0 as an orchestration tool for developing data pipelines.


II. Pipeline Flow

  1. Task 1: Fetch Data Retrieve historical data for Dow Jones (^DJI) using yfinance and convert it to a Polars DataFrame.

  2. Task 2: Process Data Perform basic transformations like filtering, renaming columns, and setting date formats using polars.

  3. Task 3: Save Data Save the cleaned and processed data to a .parquet file.

  4. Prefect Orchestration Each step of the pipeline is orchestrated using Prefect

Prefect Logo

III.Installation

  1. Clone the repository:

    git clone https://github.com/NolanMM/Perfect_v3.0_Data_Engineering_Pipeline.git
    cd Perfect_v3.0_Data_Engineering_Pipeline
  2. Create and activate a virtual environment:

    python -m venv venv
    
    // Linux
    source venv/bin/activate 
    
    // Window
    cd ./venv/Scripts
    activate
    
  3. Install the required dependencies:

    pip install -r requirements.txt

IV. Usage

1. Activate the Virtual Environment Open a separate terminal and activate the Python virtual environment:

# On Linux: source venv/bin/activate  
# On Windows: venv\Scripts\activate

2. Start Prefect Server In the same terminal, start the Prefect server to enable orchestration and monitoring:

prefect server start
Prefect Logo

This will launch the Prefect server locally. By default:

3. Run the Pipeline Open a new terminal, navigate to the project directory, activate the virtual environment,

# On Linux: source venv/bin/activate  
# On Windows: venv\Scripts\activate

and execute the pipeline:

$env:PREFECT_API_URL="http://127.0.0.1:4200/api"; python Simple_Data_Pipeline.py
Prefect Logo

4. Monitor Pipeline Execution Visit the Prefect server UI in your browser (http://127.0.0.1:4200/dashboard) to monitor task execution, inspect logs, and troubleshoot any issues.


V.Features

  1. Data Retrieval: Fetch historical data for ^DJI using yfinance.
  2. Data Processing: Perform basic data cleaning and processing using polars.
  3. Data Storage: Save the processed data to a .parquet file.
  4. Orchestration: Use Prefect v3.0 to orchestrate the data pipeline.

VI. Requirements

  • Python 3.8+
  • Libraries:
    • yfinance
    • polars
    • prefect (v3.0 or higher)

About

This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages