Audio Similarity & Tagging Hub

Universal audio embeddings + tagging and similarity search with a clean demo (Streamlit) and an API (FastAPI).
Works on a single GPU (e.g., RTX 2070) or CPU by caching embeddings. Suitable for SFX, ambient and short music clips.

Why this repo

Practical: find similar sounds, auto-tag clips, export results.
Reproducible: deterministic seeds, fixed resampling (16 kHz), pinned deps.
Lightweight: inference-only feature extractors (PANNs/YAMNet/OpenL3), small MLP/LogReg head.
Explainable: Grad-CAM heatmaps on mel-spectrograms (when using a conv backbone).

Features

Universal audio embeddings (PANNs CNN14 by default; YAMNet optional)
Tagging (multi-label) on ESC-50 / UrbanSound8K subsets (or your data)
Similarity search with FAISS (Flat/IVF-PQ) — Qdrant optional
Augmentations: random gain/shift, time/freq masking (SpecAugment), mixup
Metrics: macro-F1, mAP; Recall@K / nDCG@K for search
Interpretability: Grad-CAM overlays on mel-spectrograms
UI: Streamlit demo — upload WAV/MP3 → tags, top‑K similar clips, heatmap
API: FastAPI endpoints /embed, /tag, /search
MLOps: MLflow tracking & artifacts, Dockerized API+UI

Repo name, description & tags

Repository name: audio-similarity-tagging-hub
Short description: Universal audio embeddings + tagging + similarity search with Streamlit demo and FastAPI; PANNs/YAMNet, FAISS/Qdrant, Grad-CAM.
Topics/Tags: audio, machine-learning, deep-learning, pytorch, torchaudio, embeddings, faiss, qdrant, streamlit, fastapi, mlflow, grad-cam, esc50, urbansound8k

Architecture

Audio (wav/mp3)
   └─ resample 16 kHz mono
        └─ log-mel spectrogram (80–128 bins)
             └─ Embedding extractor (PANNs/YAMNet/OpenL3)  →  fixed-size vector
                  ├─ Classifier head (MLP/LogReg) → multi-label tags
                  └─ Vector index (FAISS/Qdrant) → top‑K neighbors

Datasets

ESC‑50 (50 classes, 2k clips, 5 folds) — environmental sounds
UrbanSound8K (10 classes, 8k clips) — urban events/noises
Optional: FSD50K subset, GTZAN (music), SpeechCommands (keywords)

You can also drop your own .wav/.mp3 into data/custom/ and use the same pipeline.

Project structure

.
├─ api/
│  ├─ main.py                # FastAPI endpoints: /embed, /tag, /search
│  ├─ models.py              # pydantic schemas
│  └─ utils.py               # audio IO, preproc, postproc
├─ ui/
│  └─ app.py                 # Streamlit demo
├─ src/
│  ├─ data/
│  │  ├─ prepare_esc50.py
│  │  └─ prepare_us8k.py
│  ├─ features/
│  │  ├─ embedder_panns.py   # CNN14 (PANNs) – default
│  │  └─ embedder_yamnet.py  # optional
│  ├─ models/
│  │  ├─ classifier.py       # MLP/LogReg multi-label head
│  │  └─ grad_cam.py         # Grad-CAM on spectrograms
│  ├─ index/
│  │  └─ faiss_index.py
│  ├─ training/
│  │  ├─ train_classifier.py
│  │  └─ eval_metrics.py
│  └─ infer/
│     └─ pipeline.py         # end-to-end: audio → tags/similar
├─ notebooks/                # EDA / experiments (optional)
├─ artifacts/                # cached embeddings, trained heads, FAISS index
├─ configs/
│  ├─ config.yaml            # paths, model, params
│  └─ classes_esc50.yaml     # label mapping
├─ tests/                    # unit tests (basic IO/shape/metrics)
├─ docker/
│  ├─ Dockerfile.api
│  ├─ Dockerfile.ui
│  └─ docker-compose.yml
├─ requirements.txt
├─ environment.yml
├─ .gitignore
├─ LICENSE
└─ README.md

Quickstart

1) Setup environment

# Option A: conda
conda env create -f environment.yml
conda activate audiohub

# Option B: venv
python -m venv .venv
source .venv/bin/activate   # on Windows: .venv\\Scripts\\activate
pip install -r requirements.txt

2) Download data (examples)

# ESC-50
python -m src.data.prepare_esc50 --root data/esc50 --download

# UrbanSound8K
python -m src.data.prepare_us8k --root data/us8k --download

3) Extract embeddings (cache)

python -m src.infer.pipeline \
  --mode embed \
  --input_dir data/esc50/audio \
  --out artifacts/embeddings/esc50_panns.npy \
  --backend panns --sr 16000 --n_mels 96

4) Train a lightweight classifier

python -m src.training.train_classifier \
  --embeddings artifacts/embeddings/esc50_panns.npy \
  --labels data/esc50/labels.csv \
  --model_out artifacts/models/mlp_esc50.pt \
  --mlflow_run_name "mlp_esc50_panns"

5) Build the FAISS index

python -m src.index.faiss_index \
  --embeddings artifacts/embeddings/esc50_panns.npy \
  --index_out artifacts/index/faiss_esc50.idx \
  --type flat --topk 10

6) Run API and UI

# API
uvicorn api.main:app --reload --port 8000

# UI (in another terminal)
streamlit run ui/app.py

Now open Streamlit and try:

Upload .wav/.mp3
See predicted tags (top‑K with scores), top‑K similar clips, and a Grad‑CAM heatmap.

FastAPI endpoints

POST /embed — returns embedding for uploaded audio
POST /tag — returns multi-label tag probabilities
POST /search — returns top‑K nearest neighbors (ids + distances)

Payloads are documented via built-in Swagger at /docs.

Evaluation

Classification: macro‑F1, mAP, confusion matrix per dataset fold
Similarity: Recall@K, nDCG@K (using class labels or curated pairs)
Ablations: PANNs vs YAMNet; with/without SpecAugment; index types

Run:

python -m src.training.eval_metrics --help

Reproducibility

Fixed RANDOM_SEED=42 across PyTorch/NumPy/torch.backends.cudnn.
Deterministic resampling to 16 kHz mono; fixed mel config.
Embedding cache versioned under artifacts/ with MLflow runs.
Model cards + data cards templates under docs/ (optional).

Docker

# Build
docker compose -f docker/docker-compose.yml build

# Run
docker compose -f docker/docker-compose.yml up

This starts two containers: api (FastAPI) and ui (Streamlit), sharing the same artifacts volume.

Configuration

All core params live in configs/config.yaml:

paths: data roots, artifacts, model/index outputs
audio: sr, n_mels, hop_length, win_length
model: backend (panns, yamnet), classifier head params
index: faiss type, top‑K
eval: metrics, folds

Roadmap

Add Qdrant backend in parallel to FAISS
Export to ONNX/TorchScript for edge inference
Add OpenL3/CLAP as optional embedders
Batch tagging/search API with rate limiting
Unit tests for Grad‑CAM correctness (saliency sanity check)
Example notebook with qualitative analysis

License

MIT — see LICENSE. Please check dataset licenses separately.

Acknowledgements

PANNs: Kong et al., PANNs: Large-Scale Pretrained Audio Neural Networks.
YAMNet: Google Research (MobileNetV1-based audio event classifier).
FAISS: Facebook AI Similarity Search.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Similarity & Tagging Hub

Why this repo

Features

Repo name, description & tags

Architecture

Datasets

Project structure

Quickstart

1) Setup environment

2) Download data (examples)

3) Extract embeddings (cache)

4) Train a lightweight classifier

5) Build the FAISS index

6) Run API and UI

FastAPI endpoints

Evaluation

Reproducibility

Docker

Configuration

Roadmap

License

Acknowledgements

About

Uh oh!

Releases

Packages

License

KonNik88/audio-similarity-tagging-hub

Folders and files

Latest commit

History

Repository files navigation

Audio Similarity & Tagging Hub

Why this repo

Features

Repo name, description & tags

Architecture

Datasets

Project structure

Quickstart

1) Setup environment

2) Download data (examples)

3) Extract embeddings (cache)

4) Train a lightweight classifier

5) Build the FAISS index

6) Run API and UI

FastAPI endpoints

Evaluation

Reproducibility

Docker

Configuration

Roadmap

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages