Skip to content

In this project we will explore data movement, quality, and storage using opensource tools (Mage, Polars, and DuckDB)

License

Notifications You must be signed in to change notification settings

jospablo777/mage_duckdb_pipeline

Repository files navigation

SODA to DuckDB: building a data pipeline for high-performance analytics with modern tools

This project demonstrates how to build an efficient data pipeline to move data from the Socrata Open Data API (SODA) to a local DuckDB database. The goal is to create a lightweight, high-performance data repository optimized for running analytical queries on your local machine. Ideal for data analysts and engineers seeking efficient, lightweight solutions for managing and analyzing large datasets locally.

You can revisit this in-depth tutorial/article to build and understand this system.

Key Features

  • Data Ingestion: retrieve datasets effortlessly from the SODA API for efficient local analytics.
  • High-Performance Analytics: leverage the power of DuckDB to enable high-speed querying and seamless analytics.
  • Modern Tooling: utilize open-source tools like Mage, Polars, and DuckDB to streamline the ETL process.

Tools and Technologies

  • Mage: orchestrates data pipelines for efficient and automated workflows.
  • Polars: accelerates data manipulation with its high-performance DataFrame capabilities.
  • DuckDB: empowers SQL-based OLAP analysis directly on your local machine.

Project Workflow

  1. Data Retrieval: fetch data from the Socrata Open Data API (SODA) using Python.
  2. Data Transformation: use Polars to manipulate and transform the data before loading it into DuckDB.
  3. Data Loading: store the resulting data into DuckDB for high-performance analytics.
  4. Analytics: perform SQL queries on the DuckDB database to generate insights.

Project structure

mage_duckdb_pipeline/
├── docs/            # Website with the tutorial
├── communication/   # Quarto, source code of the tutorial article
├── transformers/    # Mage, transformer blocks are stored here
├── data_loaders/    # Mage, data loader blocks are stored here
├── custom/          # Mage, custom blocks are stored here
├── data_exporters/  # Mage, data exporter blocks are stored here
├── pipelines/       # Mage, our pipelines are stored here
│   └── socrata_iowa_liquor_pipeline/
├── data/            # Directory for storing data, including the .duckdb file
├── metadata.yaml    # Keep record of Mage blocks relationships and activity
├── requirements.txt # Project dependencies
├── .nojekyll
├── dont_forget.sh   # Terminal instructions I don't want to forget
├── LICENSE          # Open source
└── README.md

Installation and Setup

Prerequisites

  • Python 3.12 or later
  • virtualenv (recommended)
  • Linux (I used Ubuntu 24.04)

Clone the Repository

git clone https://github.com/jospablo777/mage_duckdb_pipeline.git
cd mage_duckdb_pipeline

Install Dependencies

Set up a virtual environment and install the required Python packages:

python -m venv venv             # The first 'venv' is the command, the second is the name of the folder for the virtual environment.
source venv/bin/activate        # Activate the virtual environment.
pip install -r requirements.txt # Install dependencies from the requirements file.

Usage

Running the Pipeline

Run it directly from the terminal. To fetch some data, in the project folder run:

mage run . socrata_iowa_liquor_pipeline

To start the Mage UI, run in the terminal (in the project's folder):

mage start

Use the Mage UI to visually manage and monitor your pipeline activities

Citation

No need to cite, but it would mean a lot if you did! 😃 Feel free to use this code and project structure in your personal or work projects—make it yours!

Contact

Have questions, suggestions, or just want to connect? Feel free to reach out!

About

In this project we will explore data movement, quality, and storage using opensource tools (Mage, Polars, and DuckDB)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published