Skip to content

Scalable ETL Pipeline: Processing 5M+ retail records with PySpark on GCP Dataproc. Automated the extraction of global business KPIs and consumer trends. Includes an Ethical Data Framework to ensure privacy and fairness at scale

License

Notifications You must be signed in to change notification settings

shreyamalogi/Retail-Data-Engineering-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retail-Data-Engineering-Pipeline: Engineering Insights from 5M+ Transactions

High-Throughput ETL | Distributed Cloud Computing | Ethical Big Data

📖 The Narrative: Scaling Business Intelligence

In enterprise retail, the challenge isn't just having data; it's the speed at which you can turn 5 million raw transactions into a roadmap for growth. This project tells the story of architecting a production-grade ETL pipeline on the Google Cloud Platform (GCP), designed to handle high-velocity data and extract complex financial KPIs across global markets while maintaining a rigorous framework for data ethics.


🏗️ Chapter 1: The Ingestion Layer (GCS to Spark)

Processing 5 million records requires more than a script; it requires a cloud-native ecosystem capable of horizontal scaling.

  • Storage Orchestration: Managed the full storage lifecycle using Google Cloud Storage (GCS), ensuring low-latency data availability for the compute cluster.
  • Distributed Compute: Provisioned and managed GCP Dataproc clusters to perform parallelized transformations, significantly reducing processing time compared to local execution.
  • Data Flow: Engineered the bridge from raw CSV assets in GCS to active Spark RDDs/DataFrames for high-speed manipulation.

⚙️ Chapter 2: Performance Engineering & ETL

To extract value from a high-velocity dataset, I engineered a multi-stage transformation pipeline focusing on resource optimization and analytical depth.

  • Aggregate Logic: Developed Spark jobs to calculate global salary averages, gender distributions, and geographic purchasing power.
  • Optimization Strategy: Utilized Spark's Lazy Evaluation to optimize the execution plan and implemented Coalesce functions to manage output file counts, ensuring the final "Single CSV" report was consolidated without driver node failure.
  • Feature Engineering: Normalized spatial and demographic data (Age, Sex, Country) to provide a unified view of the retail landscape.

⚖️ Chapter 3: The Ethical Framework (Case Study)

Data engineering is as much about Fairness as it is about Features. I integrated an ethical audit into this pipeline, inspired by the "Big Retail Corp" scenario.

  • Transparency vs. Privacy: Evaluated the trade-offs between hyper-personalization and constant background tracking.
  • Vulnerability Analysis: Investigated how predictive targeting can avoid exploiting financially stressed or vulnerable customer segments.
  • Consent Architecture: Advocated for moving away from opaque Terms & Conditions toward readable, simplified disclosures and opt-out dashboards.

📊 Business Intelligence Insights

The pipeline successfully extracted the following KPIs from the 5,000,000+ record dataset:

KPI Strategic Impact
Avg Salary by Country Identified high-value regional markets for premium product targeting.
Gender & Age Distribution Balanced demographic inventory based on maturity and segment potential.
Total Spend by Region Prioritized logistics and marketing budget for high-engagement zones.
Customer Density Found top-performing markets to optimize supply chain footprints.

🚀 Deployment & Cloud Execution

Provision and run this pipeline on a distributed cluster using the GCP CLI:

1. Provision the Cluster

gcloud dataproc clusters create retail-analytics-cluster \
    --region=europe-west3 \
    --master-machine-type=n1-standard-2 \
    --worker-machine-type=n1-standard-2 \
    --num-workers=2

2. Submit the ETL Job

# For a combined single CSV output:
gcloud dataproc jobs submit pyspark gs://[YOUR_BUCKET]/scripts/retail_analysis_combined.py \
  --cluster=retail-analytics-cluster --region=europe-west3

👨‍💻 Project Stewardship


About

Scalable ETL Pipeline: Processing 5M+ retail records with PySpark on GCP Dataproc. Automated the extraction of global business KPIs and consumer trends. Includes an Ethical Data Framework to ensure privacy and fairness at scale

Topics

Resources

License

Stars

Watchers

Forks

Languages