This project is a Song Genre Classifier built using Apache Spark MLlib and Spring Boot. It classifies songs into predefined genres based on their lyrics.
Course Assignment: This project was developed as part of the coursework for In20-S8-CS4651 - Big Data Analytics, Week 10: Big Data Visualisation, MLlib and Visualisation Homework.
The application provides functionalities to:
- Train a machine learning model (Logistic Regression) using a dataset of song lyrics and their genres.
- Expose a REST API to predict the genre of new song lyrics using the trained model.
The ML pipeline involves several custom Spark transformers for text preprocessing:
- Cleanser: Removes non-alphabetic characters and converts text to lowercase.
- Numerator: Assigns a row number, used internally by other transformers.
- Tokenizer: Splits cleaned lyrics into words.
- StopWordsRemover: Removes common stop words.
- Exploder: Converts an array of words into individual rows, each containing one word.
- Stemmer: Stems each word to its root form (e.g., "running" to "run") using the English Snowball stemmer.
- Uniter: Aggregates stemmed words back into sentences per original lyric entry.
- Verser: Groups sentences into "verses" (configurable number of sentences per verse).
- Word2Vec: Converts verses (sequences of words) into feature vectors.
- LogisticRegression: The classification algorithm.
The pipeline is tuned using CrossValidator.
- Java Development Kit (JDK): Version 17
- Apache Maven: For building the project and managing dependencies.
- Git: For cloning the repository (if applicable).
- (For Windows users):
winutils.exeandhadoop.dllcorrectly set up for Hadoop, or ensureHADOOP_HOMEenvironment variable is set. The project attempts to configure this viaspark.driver.extraJavaOptions=-Dhadoop.home.dir=C:/Hadoopinapplication.properties, which you might need to adjust for your system.
.
├── pom.xml # Maven Project Object Model
├── models/ # Default directory for saved ML models
├── src/
│ ├── main/
│ │ ├── java/com/lyrics/classifier/
│ │ │ ├── ClassifierApplication.java # Spring Boot main application
│ │ │ ├── column/ # DataFrame column definitions
│ │ │ ├── config/ # Spring and Spark configurations (SparkSession, TrainRunner)
│ │ │ ├── controller/ # REST API controllers (LyricsController)
│ │ │ ├── service/ # Business logic (LyricsService, MLService)
│ │ │ │ ├── lyrics/ # Genre classification specific services
│ │ │ │ │ ├── pipeline/ # ML pipelines (LogisticRegressionPipeline)
│ │ │ │ │ └── transformer/ # Custom Spark ML Transformers
│ │ ├── resources/
│ │ │ ├── application.properties # Application configuration
│ │ │ ├── data/training/ # Expected location for training data
│ │ │ │ └── Merged_dataset1.csv # Training data CSV (needs to be provided)
│ │ │ └── META-INF/
│ └── test/
│ └── java/ # Unit and integration tests
└── README.md # This file
Key configuration settings are in src/main/resources/application.properties:
lyrics.csv.path: Path to the training data CSV file. Default:src/main/resources/data/training/Merged_dataset1.csv.lyrics.model.directory.path: Directory where trained models are saved and loaded from. Default:models.mode: Application operating mode.train: The application trains the model upon startup and then exits.serve: (Default) The application starts, loads a pre-trained model (if available), and serves prediction requests via API.
logging.level.*: Configures logging levels for the application and libraries like Spark.
- The application expects a CSV file for training, specified by
lyrics.csv.path. - Format: The CSV file must contain at least two columns with headers:
lyrics: The song lyrics (text).genre: The genre of the song (e.g., "POP", "ROCK", "JAZZ"). The system handles case-insensitivity for genre names defined inGenre.java.
- Sample CSV structure:
lyrics,genre "Some pop song lyrics here...",POP "Rock and roll lyrics...",ROCK "Smooth jazz verses...",JAZZ
- Note: You need to provide this CSV file in the configured path.
To build the project and package it into a JAR file:
mvn clean packageThis will generate a JAR file in the target/ directory (e.g., target/classifier-0.0.1-SNAPSHOT.jar).
To run the unit and integration tests:
mvn testThere are two primary modes to run the application: Training Mode and Serving Mode.
In this mode, the application will read the data from lyrics.csv.path, train the classification model, save it to lyrics.model.directory.path, and then exit.
Steps:
- Ensure your training data CSV (
Merged_dataset1.csvor as configured) is in place. - Set
mode=traininsrc/main/resources/application.properties. - Run the application:
Alternatively, using Maven:
java -jar target/classifier-0.0.1-SNAPSHOT.jar
mvn spring-boot:run
- Check the console for training logs and statistics. The trained model will be saved in the
models/logreg_customdirectory (or as configured).
In this mode, the application loads a previously trained model and exposes an API endpoint for genre prediction.
Steps:
- Ensure a model has been trained and saved (e.g., by running in Training Mode first).
- Set
mode=serve(or leave it as default) insrc/main/resources/application.properties. - Run the application:
Alternatively, using Maven:
java -jar target/classifier-0.0.1-SNAPSHOT.jar
mvn spring-boot:run
- The application will start and be ready to serve requests on port
8080(default Spring Boot port).
The application exposes the following REST API endpoints, accessible by default at http://localhost:8080.
Swagger UI for API documentation is typically available at http://localhost:8080/swagger-ui.html.
- Endpoint:
POST /api/train - Description: Triggers the model training process. Reads data, trains, saves the model, and returns statistics. This can be used to retrain the model while the application is in
servemode. - Request Body: None
- Response: JSON object with model statistics, e.g.:
{ "Best model metric (higher is better)": 0.85, // Example metric value "testSetAccuracy": 0.83 // Example accuracy on the test set }
- Endpoint:
POST /api/predict - Description: Predicts the genre for the provided song lyrics.
- Request Body: JSON object with lyrics:
{ "lyrics": "Some new song lyrics to classify..." } - Response: JSON object with the predicted genre and probabilities for each genre:
{ "predictedGenre": "POP", // Example "probabilities": { "pop": 0.75, "rock": 0.15, "jazz": 0.05, // ... other genres } }
- After running in
trainmode or calling/api/train, check the console logs for messages indicating successful training and model saving. - Verify that a model directory (e.g.,
models/logreg_custom) has been created/updated.
-
Start the application in
servemode. -
Use a tool like
curlor Postman to send a POST request to the/api/predictendpoint:Using curl:
curl -X POST -H "Content-Type: application/json" -d "{\"lyrics\":\"love you baby like a love song\"}" http://localhost:8080/api/predict
-
Check the response for the predicted genre.
- Windows Hadoop Configuration: If you are running on Windows, Spark might require
winutils.exe. Theapplication.propertiesfile includesspark.driver.extraJavaOptions=-Dhadoop.home.dir=C:/Hadoop. You may need to adjustC:/Hadoopto your Hadoop binaries' location or ensureHADOOP_HOMEis set andwinutils.exeis in itsbindirectory. - Spark UI: The Spark UI (for monitoring jobs) is disabled by default in
SparkConfig.java(.set("spark.ui.enabled", "false")is commented out). If enabled, it usually runs on port4040. - Model Persistence: The
StringIndexerModelfor genres is re-fitted duringprepareData(). For robust prediction in a standalone serving environment where the original training CSV might not be available, this indexer model (or its labels) should ideally be saved and loaded along with the mainCrossValidatorModel. The current implementation re-reads the CSV and re-fits the indexer ifpredict()is called on a new instance of the pipeline (e.g., after application restart).