Skip to content

Universal audio embeddings + tagging + similarity search with Streamlit demo and FastAPI; PANNs/YAMNet, FAISS/Qdrant, Grad-CAM.

License

Notifications You must be signed in to change notification settings

KonNik88/audio-similarity-tagging-hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Audio Similarity & Tagging Hub

Python PyTorch Streamlit FastAPI MLflow Docker License: MIT

Universal audio embeddings + tagging and similarity search with a clean demo (Streamlit) and an API (FastAPI).
Works on a single GPU (e.g., RTX 2070) or CPU by caching embeddings. Suitable for SFX, ambient and short music clips.


Why this repo

  • Practical: find similar sounds, auto-tag clips, export results.
  • Reproducible: deterministic seeds, fixed resampling (16 kHz), pinned deps.
  • Lightweight: inference-only feature extractors (PANNs/YAMNet/OpenL3), small MLP/LogReg head.
  • Explainable: Grad-CAM heatmaps on mel-spectrograms (when using a conv backbone).

Features

  • Universal audio embeddings (PANNs CNN14 by default; YAMNet optional)
  • Tagging (multi-label) on ESC-50 / UrbanSound8K subsets (or your data)
  • Similarity search with FAISS (Flat/IVF-PQ) — Qdrant optional
  • Augmentations: random gain/shift, time/freq masking (SpecAugment), mixup
  • Metrics: macro-F1, mAP; Recall@K / nDCG@K for search
  • Interpretability: Grad-CAM overlays on mel-spectrograms
  • UI: Streamlit demo — upload WAV/MP3 → tags, top‑K similar clips, heatmap
  • API: FastAPI endpoints /embed, /tag, /search
  • MLOps: MLflow tracking & artifacts, Dockerized API+UI

Repo name, description & tags

  • Repository name: audio-similarity-tagging-hub
  • Short description: Universal audio embeddings + tagging + similarity search with Streamlit demo and FastAPI; PANNs/YAMNet, FAISS/Qdrant, Grad-CAM.
  • Topics/Tags: audio, machine-learning, deep-learning, pytorch, torchaudio, embeddings, faiss, qdrant, streamlit, fastapi, mlflow, grad-cam, esc50, urbansound8k

Architecture

Audio (wav/mp3)
   └─ resample 16 kHz mono
        └─ log-mel spectrogram (80–128 bins)
             └─ Embedding extractor (PANNs/YAMNet/OpenL3)  →  fixed-size vector
                  ├─ Classifier head (MLP/LogReg) → multi-label tags
                  └─ Vector index (FAISS/Qdrant) → top‑K neighbors

Datasets

  • ESC‑50 (50 classes, 2k clips, 5 folds) — environmental sounds
  • UrbanSound8K (10 classes, 8k clips) — urban events/noises
  • Optional: FSD50K subset, GTZAN (music), SpeechCommands (keywords)

You can also drop your own .wav/.mp3 into data/custom/ and use the same pipeline.


Project structure

.
├─ api/
│  ├─ main.py                # FastAPI endpoints: /embed, /tag, /search
│  ├─ models.py              # pydantic schemas
│  └─ utils.py               # audio IO, preproc, postproc
├─ ui/
│  └─ app.py                 # Streamlit demo
├─ src/
│  ├─ data/
│  │  ├─ prepare_esc50.py
│  │  └─ prepare_us8k.py
│  ├─ features/
│  │  ├─ embedder_panns.py   # CNN14 (PANNs) – default
│  │  └─ embedder_yamnet.py  # optional
│  ├─ models/
│  │  ├─ classifier.py       # MLP/LogReg multi-label head
│  │  └─ grad_cam.py         # Grad-CAM on spectrograms
│  ├─ index/
│  │  └─ faiss_index.py
│  ├─ training/
│  │  ├─ train_classifier.py
│  │  └─ eval_metrics.py
│  └─ infer/
│     └─ pipeline.py         # end-to-end: audio → tags/similar
├─ notebooks/                # EDA / experiments (optional)
├─ artifacts/                # cached embeddings, trained heads, FAISS index
├─ configs/
│  ├─ config.yaml            # paths, model, params
│  └─ classes_esc50.yaml     # label mapping
├─ tests/                    # unit tests (basic IO/shape/metrics)
├─ docker/
│  ├─ Dockerfile.api
│  ├─ Dockerfile.ui
│  └─ docker-compose.yml
├─ requirements.txt
├─ environment.yml
├─ .gitignore
├─ LICENSE
└─ README.md

Quickstart

1) Setup environment

# Option A: conda
conda env create -f environment.yml
conda activate audiohub

# Option B: venv
python -m venv .venv
source .venv/bin/activate   # on Windows: .venv\\Scripts\\activate
pip install -r requirements.txt

2) Download data (examples)

# ESC-50
python -m src.data.prepare_esc50 --root data/esc50 --download

# UrbanSound8K
python -m src.data.prepare_us8k --root data/us8k --download

3) Extract embeddings (cache)

python -m src.infer.pipeline \
  --mode embed \
  --input_dir data/esc50/audio \
  --out artifacts/embeddings/esc50_panns.npy \
  --backend panns --sr 16000 --n_mels 96

4) Train a lightweight classifier

python -m src.training.train_classifier \
  --embeddings artifacts/embeddings/esc50_panns.npy \
  --labels data/esc50/labels.csv \
  --model_out artifacts/models/mlp_esc50.pt \
  --mlflow_run_name "mlp_esc50_panns"

5) Build the FAISS index

python -m src.index.faiss_index \
  --embeddings artifacts/embeddings/esc50_panns.npy \
  --index_out artifacts/index/faiss_esc50.idx \
  --type flat --topk 10

6) Run API and UI

# API
uvicorn api.main:app --reload --port 8000

# UI (in another terminal)
streamlit run ui/app.py

Now open Streamlit and try:

  • Upload .wav/.mp3
  • See predicted tags (top‑K with scores), top‑K similar clips, and a Grad‑CAM heatmap.

FastAPI endpoints

  • POST /embed — returns embedding for uploaded audio
  • POST /tag — returns multi-label tag probabilities
  • POST /search — returns top‑K nearest neighbors (ids + distances)

Payloads are documented via built-in Swagger at /docs.


Evaluation

  • Classification: macro‑F1, mAP, confusion matrix per dataset fold
  • Similarity: Recall@K, nDCG@K (using class labels or curated pairs)
  • Ablations: PANNs vs YAMNet; with/without SpecAugment; index types

Run:

python -m src.training.eval_metrics --help

Reproducibility

  • Fixed RANDOM_SEED=42 across PyTorch/NumPy/torch.backends.cudnn.
  • Deterministic resampling to 16 kHz mono; fixed mel config.
  • Embedding cache versioned under artifacts/ with MLflow runs.
  • Model cards + data cards templates under docs/ (optional).

Docker

# Build
docker compose -f docker/docker-compose.yml build

# Run
docker compose -f docker/docker-compose.yml up

This starts two containers: api (FastAPI) and ui (Streamlit), sharing the same artifacts volume.


Configuration

All core params live in configs/config.yaml:

  • paths: data roots, artifacts, model/index outputs
  • audio: sr, n_mels, hop_length, win_length
  • model: backend (panns, yamnet), classifier head params
  • index: faiss type, top‑K
  • eval: metrics, folds

Roadmap

  • Add Qdrant backend in parallel to FAISS
  • Export to ONNX/TorchScript for edge inference
  • Add OpenL3/CLAP as optional embedders
  • Batch tagging/search API with rate limiting
  • Unit tests for Grad‑CAM correctness (saliency sanity check)
  • Example notebook with qualitative analysis

License

MIT — see LICENSE. Please check dataset licenses separately.


Acknowledgements

  • PANNs: Kong et al., PANNs: Large-Scale Pretrained Audio Neural Networks.
  • YAMNet: Google Research (MobileNetV1-based audio event classifier).
  • FAISS: Facebook AI Similarity Search.

About

Universal audio embeddings + tagging + similarity search with Streamlit demo and FastAPI; PANNs/YAMNet, FAISS/Qdrant, Grad-CAM.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published