A comprehensive, structured, and reproducible metadata dataset of Bollywood film albums and songs spanning over 90 years (1931–2025).
This dataset is designed for:
- 🎧 Music information retrieval
- 📊 Data analysis & visualization
- 🤖 Machine learning & recommendation systems
- 🧠 Digital humanities & cultural research
- 🏗️ Scalable music databases
| Category | Count |
|---|---|
| 🎞️ Albums | 12,673 |
| 🎵 Songs | 57,005 |
| 📅 Year Range | 1931 – 2025 |
| 📁 Format | CSV |
| 🔑 Identifiers | Deterministic UUIDv5 |
| 🔁 De-duplication | Yes (content-based) |
Detailed schema for the CSV files found in data/raw/.
Each row represents a unique film soundtrack or album.
| Column Name | Type | Description | Example |
|---|---|---|---|
album_uuid |
string |
Primary Key. Deterministic UUIDv5 generated from title + year. | a3f9... |
album_title |
string |
Normalized title of the film/album. | Sholay |
album_year |
int |
Release year of the album. | 1975 |
music_director |
string |
Primary composer(s) of the album. | R.D. Burman |
label |
string |
Music label (e.g., T-Series, Saregama), if available. | Polydor |
Each row represents a unique track within an album.
| Column Name | Type | Description | Example |
|---|---|---|---|
song_uuid |
string |
Primary Key. Deterministic UUIDv5 generated from album + track + title. | c1b2... |
album_uuid |
string |
Foreign Key. Links to the parent Album. | a3f9... |
track_number |
int |
Sequential order of the song in the album. | 1 |
song_title |
string |
Title of the track. | Mehbooba Mehbooba |
singers |
string |
Comma-separated list of singers. | R.D. Burman |
1 youtube_url |
string |
Direct link to the official music video or audio. | https://youtu.be/... |
3 yt_music_urls |
string |
Direct link to the official music track | https://music.youtube.com/watch?v=..... |
data/
├── raw/
│ ├── albums/
│ │ ├── albums_1931_1944.csv
│ │ ├── albums_1945_1954.csv
│ │ ├── ...
│ │ └── albums_2015_2025.csv
│ └── songs/
│ ├── songs_1931_1944.csv
│ ├── songs_1945_1954_completed.csv
│ ├── ...
│ └── songs_2015_2025.csv
| Period | Albums |
|---|---|
| 1931–1944 | 387 |
| 1945–1954 | 1,066 |
| 1955–1964 | 1,058 |
| 1965–1974 | 1,079 |
| 1975–1984 | 1,173 |
| 1985–1994 | 1,245 |
| 1995–2004 | 1,127 |
| 2005–2014 | 1,551 |
| 2015–2025 | 3,987 |
| Period | Songs |
|---|---|
| 1931–1944 | 1,606 |
| 1945–1954 | 6,507 |
| 1955–1964 | 6,757 |
| 1965–1974 | 5,411 |
| 1975–1984 | 5,950 |
| 1985–1994 | 6,862 |
| 1995–2004 | 6,248 |
| 2005–2014 | 8,191 |
| 2015–2025 | 9,473 |
This dataset uses UUID version 5 to ensure stable, repeatable, and collision-free identifiers across re-scrapes and future updates.
Each album UUID is generated from a normalized, human-readable album identity.
album_titlealbum_year
unique_album_string = f"{album_title}_{album_year}".lower().strip()
album_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, unique_album_string)- Same album title + year → same UUID forever
- Prevents duplicate albums across re-scrapes
- Stable primary key for joins, updates, and merges
- Human-explainable identity source
Each song UUID is generated relative to its parent album, ensuring correct hierarchy and uniqueness.
album_uuid(parent)track_numbersong_title
unique_song_string = f"{album_uuid}_{track_number}_{song_title}".lower().strip()
song_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, unique_song_string)- Same album + track + title → same song UUID
- Allows identical song titles across different albums
- Enforces album → song referential integrity
- Ideal for relational databases and graph models
Album (album_uuid)
├── album_title
├── album_year
├── music_director
└── label
│
└── Song (song_uuid)
├── track_number
├── song_title
├── singers
└── youtube_url
- 🎼 Music recommendation engines
- 📈 Trend analysis across decades
- 🎤 Singer & composer network graphs
- 🤖 ML training datasets
- 🗃️ Music archival systems
- Flat CSV files
- Stable UUID primary keys
- Clear schema
- Deterministic regeneration
- Version-friendly structure
Perfect for:
- Kaggle Datasets
- HuggingFace
datasets - BigQuery / DuckDB / Postgres imports
If you use this dataset in research, projects, or publications, please cite it as:
@dataset{bollywood_metadata_1931_2025,
title = {Bollywood Songs and Albums Metadata Dataset (1931--2025)},
author = {Asacker},
year = {2026},
url = {https://github.com/MrAsacker/bollywood-songs-metadata},
note = {12,673 albums and 57,005 songs with deterministic UUIDv5 identifiers}
}This project and dataset are released under the MIT License.
You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the dataset and code, subject to the terms of the MIT License.
See the LICENSE file for full details.