Skip to content

A predictive analytics platform designed to transform retail inventory management through intelligent sales forecasting. SmartStock-Analytics leverages machine learning to predict demand for 50 products across 10 store locations, empowering retailers to strategically manage stock levels, allocate resources efficiently, and minimize inventory costs.

Notifications You must be signed in to change notification settings

rkb32/Azure-Databricks-ETL-with-Delta-Lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks ETL Pipeline - PySpark and Delta Lake

This project demonstrates the design and implementation of a modern Lakehouse architecture on Azure, built for scalable, reliable, and automated data processing.

The pipeline ingests raw data, applies multi-layered transformations (Bronze → Silver → Gold), and produces curated Delta tables for analytics and reporting. Orchestration is handled via Azure Data Factory (ADF) and Databricks Jobs, with built-in monitoring, notifications, and governance.

Unlike academic projects, this initiative reflects real-world enterprise practices — ingestion pipelines, orchestration, IAM-based security, incremental transformations, and enriched analytics-ready data.

Why this project matters: It reflects real-world Data Engineering best practices—something I can bring directly into your engineering team.

Tech Stack

Layer Technology Used
Storage Azure Data Lake Storage Gen2 (ADLS)
Compute/ETL Azure Databricks (PySpark, Delta Lake)
Orchestration Azure Data Factory (ADF Pipelines), Databricks Jobs
Data Formats Delta, Parquet
Governance Azure IAM (RBAC, token-based access), Resource Templates
Analytics Power BI (optional downstream for visualization)

Data Flow

  • Bronze Layer (Raw Ingest)

    • Used Kaggle API for automated dataset extraction.
    • Stored raw data as ingested from source
    • Schema applied but no business logic
  • Silver Layer (Curated Clean Data)

    • PySpark transformations:
      • Null handling
      • Data type casting
      • Removal of unwanted characters (regex cleaning)
      • Consistent schema enforcement
  • Gold Layer (Business-Ready Data)

    • Added business-derived features:
      • Movie Era → (Old, Middle, Recent)
      • Rating Category → (Highly Rated, Good, Moderate, Low)
      • Popularity → (Most Popular, Moderate, Low)
    • Stored as Delta tables and Parquet for downstream analytics

Orchestration

  • Azure Data Factory (ADF):

    • Executes Databricks notebooks sequentially with success conditions
    • Scheduled weekly pipeline runs
  • Databricks Jobs:

    • Runs daily automated jobs
    • Configured with failure email notifications

Security & Governance

  • Secrets Management: Stored credentials in Azure Key Vault; accessed securely in Databricks via Secret Scopes.
  • Authentication: Used OAuth 2.0 with Service Principal; IAM roles assigned with least-privilege access (Storage Blob Data Contributor).
  • Secure Data Access: Mounted ADLS Gen2 in Databricks without exposing keys.
  • Governance: All resources deployed via ARM Templates; notebooks & pipelines exported for auditability.

Outputs

  • Cleaned & enriched data in Delta format (Silver, Gold)
  • Delta Tables registered in Databricks
  • Gold layer Parquet exported for BI/analytics

Key Highlights

  • Self-initiated industry-style project, not coursework
  • PySpark-based scalable transformations on Databricks
  • Real-world Lakehouse design (Bronze → Silver → Gold)
  • Automated with ADF pipelines + Databricks Jobs
  • Secure & production-ready with IAM roles
  • Value-added business features for analytics readiness

About

A predictive analytics platform designed to transform retail inventory management through intelligent sales forecasting. SmartStock-Analytics leverages machine learning to predict demand for 50 products across 10 store locations, empowering retailers to strategically manage stock levels, allocate resources efficiently, and minimize inventory costs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published