GitHub - NolanMM/Perfect_v3.0_Data_Engineering_Pipeline: This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library

Simple Data Pipeline

Retrieve and Process Dow Jones (DJI) Data Using Prefect v3.0

I. Overview

This project demonstrates how to build a simple data pipeline to retrieve the Dow Jones Industrial Average (^DJI) historical data from Yahoo Finance using the yfinance library. The data is processed and saved to a .parquet file using the polars library. The primary goal is to showcase the simplicity and efficiency of using Prefect v3.0 as an orchestration tool for developing data pipelines.

II. Pipeline Flow

Task 1: Fetch Data Retrieve historical data for Dow Jones (^DJI) using yfinance and convert it to a Polars DataFrame.
Task 2: Process Data Perform basic transformations like filtering, renaming columns, and setting date formats using polars.
Task 3: Save Data Save the cleaned and processed data to a .parquet file.
Prefect Orchestration Each step of the pipeline is orchestrated using Prefect

III.Installation

Clone the repository:

git clone https://github.com/NolanMM/Perfect_v3.0_Data_Engineering_Pipeline.git
cd Perfect_v3.0_Data_Engineering_Pipeline

Create and activate a virtual environment:

python -m venv venv

// Linux
source venv/bin/activate 

// Window
cd ./venv/Scripts
activate

Install the required dependencies:
```
pip install -r requirements.txt
```

IV. Usage

1. Activate the Virtual Environment Open a separate terminal and activate the Python virtual environment:

# On Linux: source venv/bin/activate  
# On Windows: venv\Scripts\activate

2. Start Prefect Server In the same terminal, start the Prefect server to enable orchestration and monitoring:

prefect server start

This will launch the Prefect server locally. By default:

The server UI is accessible at http://127.0.0.1:4200/dashboard
Keep this terminal running during the execution of the pipeline.

3. Run the Pipeline Open a new terminal, navigate to the project directory, activate the virtual environment,

# On Linux: source venv/bin/activate  
# On Windows: venv\Scripts\activate

and execute the pipeline:

$env:PREFECT_API_URL="http://127.0.0.1:4200/api"; python Simple_Data_Pipeline.py

4. Monitor Pipeline Execution Visit the Prefect server UI in your browser (http://127.0.0.1:4200/dashboard) to monitor task execution, inspect logs, and troubleshoot any issues.

V.Features

Data Retrieval: Fetch historical data for ^DJI using yfinance.
Data Processing: Perform basic data cleaning and processing using polars.
Data Storage: Save the processed data to a .parquet file.
Orchestration: Use Prefect v3.0 to orchestrate the data pipeline.

VI. Requirements

Python 3.8+
Libraries:
- yfinance
- polars
- prefect (v3.0 or higher)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data_storage/^DJI		data_storage/^DJI
documents		documents
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Simple_Data_Pipeline.py		Simple_Data_Pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Data Pipeline

Retrieve and Process Dow Jones (DJI) Data Using Prefect v3.0

I. Overview

II. Pipeline Flow

III.Installation

IV. Usage

V.Features

VI. Requirements

About

Uh oh!

Releases 1

Packages

Languages

License

NolanMM/Perfect_v3.0_Data_Engineering_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Simple Data Pipeline

Retrieve and Process Dow Jones (DJI) Data Using Prefect v3.0

I. Overview

II. Pipeline Flow

III.Installation

IV. Usage

V.Features

VI. Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages