In enterprise retail, the challenge isn't just having data; it's the speed at which you can turn 5 million raw transactions into a roadmap for growth. This project tells the story of architecting a production-grade ETL pipeline on the Google Cloud Platform (GCP), designed to handle high-velocity data and extract complex financial KPIs across global markets while maintaining a rigorous framework for data ethics.
Processing 5 million records requires more than a script; it requires a cloud-native ecosystem capable of horizontal scaling.
- Storage Orchestration: Managed the full storage lifecycle using Google Cloud Storage (GCS), ensuring low-latency data availability for the compute cluster.
- Distributed Compute: Provisioned and managed GCP Dataproc clusters to perform parallelized transformations, significantly reducing processing time compared to local execution.
- Data Flow: Engineered the bridge from raw CSV assets in GCS to active Spark RDDs/DataFrames for high-speed manipulation.
To extract value from a high-velocity dataset, I engineered a multi-stage transformation pipeline focusing on resource optimization and analytical depth.
- Aggregate Logic: Developed Spark jobs to calculate global salary averages, gender distributions, and geographic purchasing power.
- Optimization Strategy: Utilized Spark's Lazy Evaluation to optimize the execution plan and implemented Coalesce functions to manage output file counts, ensuring the final "Single CSV" report was consolidated without driver node failure.
- Feature Engineering: Normalized spatial and demographic data (Age, Sex, Country) to provide a unified view of the retail landscape.
Data engineering is as much about Fairness as it is about Features. I integrated an ethical audit into this pipeline, inspired by the "Big Retail Corp" scenario.
- Transparency vs. Privacy: Evaluated the trade-offs between hyper-personalization and constant background tracking.
- Vulnerability Analysis: Investigated how predictive targeting can avoid exploiting financially stressed or vulnerable customer segments.
- Consent Architecture: Advocated for moving away from opaque Terms & Conditions toward readable, simplified disclosures and opt-out dashboards.
The pipeline successfully extracted the following KPIs from the 5,000,000+ record dataset:
| KPI | Strategic Impact |
|---|---|
| Avg Salary by Country | Identified high-value regional markets for premium product targeting. |
| Gender & Age Distribution | Balanced demographic inventory based on maturity and segment potential. |
| Total Spend by Region | Prioritized logistics and marketing budget for high-engagement zones. |
| Customer Density | Found top-performing markets to optimize supply chain footprints. |
Provision and run this pipeline on a distributed cluster using the GCP CLI:
gcloud dataproc clusters create retail-analytics-cluster \
--region=europe-west3 \
--master-machine-type=n1-standard-2 \
--worker-machine-type=n1-standard-2 \
--num-workers=2
# For a combined single CSV output:
gcloud dataproc jobs submit pyspark gs://[YOUR_BUCKET]/scripts/retail_analysis_combined.py \
--cluster=retail-analytics-cluster --region=europe-west3
- Lead Developer: Shreya Malogi (Founder @ Codemacrocosm)
- Dataset Source: Retail Store 5M Dataset
- Status: Production-ready Data Engineering Proof-of-Concept.