This project builds an automated ETL pipeline to collect weekly global Top 50 song data from Topsify (A channel in Spotify) from the Spotify API and store it in Snowflake using AWS services. The pipeline is designed for a client interested in tracking global music trends over time to gain insights for data-driven content creation in the music industry.
By collecting data every week over a year, the client will be able to uncover patterns related to trending artists, genres, and albums. This will allow them to understand what makes a song successful and make data-informed decisions when creating new music content.
- Source: Spotify Top 50 Global Playlist (Topsify)
- Trigger: Weekly via Amazon CloudWatch
- Lambda: Python-based function uses the Spotify API to extract current playlist data and tranforms it into json format.
- Raw Data Storage: Stored in Amazon S3 as json
- Trigger: S3 Object PUT triggers the transformation
- AWS Glue: Spark job performs data cleaning and transformation on raw JSON
- Output: Transformed data stored back into Amazon S3 as csv
- Snowpipe: Automatically ingests the transformed data from S3
- Snowflake: Stores structured and queryable song data for downstream analysis
- Spotify API (Data Source)
- AWS Lambda (ETL Trigger + Extraction in Python)
- AWS CloudWatch (Trigger for Lambda)
- AWS S3 (Raw and Transformed Data Storage)
- AWS Glue (Apache Spark-based Transformation)
- Snowpipe & Snowflake (Data Warehouse & Auto-loading)
- Fully serverless and scalable architecture
- Collects weekly updates without manual intervention
- Enables year-long data accumulation for rich analytics