This project develops a machine learning pipeline to predict song popularity using Spotify audio features and metadata from the Spotify Web API. The foundational implementation uses Random Forest regression to generate popularity scores and rank songs based on predictions. The project was extended into a comparative research study evaluating three regression algorithms: Random Forest, XGBoost, and LightGBM, to determine which algorithm most effectively predicts track success.
- Retrieves metadata and audio features using the Spotify Web API with Python batch processing.
- Comprehensive data preprocessing and feature engineering to extract meaningful insights.
- Random Forest Regressor for popularity prediction with strong performance metrics.
- Ranks songs based on predicted popularity scores.
- Python - Primary programming language
- Pandas - Data manipulation and preprocessing
- Spotify Web API - Real-time data integration
- scikit-learn - Random Forest & metrics
- LightGBM - Gradient boosting implementation
- XGBoost - Advanced gradient boosting
- Matplotlib / Seaborn - Data visualization
To run this project, you will need the following hardware and software requirements.
Hardware Requirements:
- Primary Memory: 8.00 GB
- Secondary Memory: 1 TB
- Processor: 10th Generation Intel Core i5
Software Requirements:
- Python: Version 3.7
- Google Colab or Pycharm or any preferred integrated development environment.
- Libraries: spotipy, Pandas, NumPy, Scikit-learn, Matplotlib, csv, time
- Integrated with Spotify Web API to retrieve song metadata
- Extracted 10+ audio features per track (energy, danceability, acousticness, etc.)
- Implemented Python batch processing for efficient large-scale data retrieval
Steps:
- Create an application in the Spotify developers
- Obtain the Client ID and Client secret
- Run the Python code to retrieve data from the Spotify Web API
- Cleaned and normalised audio features using Pandas
- Handled missing values and outliers
- Standardized feature scaling for optimal model performance
- Built and trained three different machine learning algorithms
- Optimized hyperparameters for each model
- Generates popularity scores (0-100) for any song
- Ranks songs based on predicted popularity
- Scatter plots: True vs Predicted popularity scores
XGB Regressor

LightGBM Regressor

Random Forest Regressor

| Metric | Value |
|---|---|
| R² Score | 0.99 |
| RMSE | 0.22 |
| MAE | 0.16 |
| MSE | 0.05 |
| Metric | Value |
|---|---|
| R² Score | 0.99 |
| RMSE | 0.20 |
| MAE | 0.13 |
| MSE | 0.04 |
| Metric | Value |
|---|---|
| R² Score | 0.99 |
| RMSE | 0.20 |
| MAE | 0.15 |
| MSE | 0.04 |
- LightGBM stood out for its predictive accuracy.
- All three algorithms performed exceptionally well (R² ≥ 0.99)
Note
Please make sure you have the necessary Python Libraries and dependencies installed.
Note: This project is for educational purposes. Please respect Spotify's API terms of service when using this code.

