Universal audio embeddings + tagging and similarity search with a clean demo (Streamlit) and an API (FastAPI).
Works on a single GPU (e.g., RTX 2070) or CPU by caching embeddings. Suitable for SFX, ambient and short music clips.
- Practical: find similar sounds, auto-tag clips, export results.
- Reproducible: deterministic seeds, fixed resampling (16 kHz), pinned deps.
- Lightweight: inference-only feature extractors (PANNs/YAMNet/OpenL3), small MLP/LogReg head.
- Explainable: Grad-CAM heatmaps on mel-spectrograms (when using a conv backbone).
- Universal audio embeddings (PANNs CNN14 by default; YAMNet optional)
- Tagging (multi-label) on ESC-50 / UrbanSound8K subsets (or your data)
- Similarity search with FAISS (Flat/IVF-PQ) — Qdrant optional
- Augmentations: random gain/shift, time/freq masking (SpecAugment), mixup
- Metrics: macro-F1, mAP; Recall@K / nDCG@K for search
- Interpretability: Grad-CAM overlays on mel-spectrograms
- UI: Streamlit demo — upload WAV/MP3 → tags, top‑K similar clips, heatmap
- API: FastAPI endpoints
/embed,/tag,/search - MLOps: MLflow tracking & artifacts, Dockerized API+UI
- Repository name:
audio-similarity-tagging-hub - Short description: Universal audio embeddings + tagging + similarity search with Streamlit demo and FastAPI; PANNs/YAMNet, FAISS/Qdrant, Grad-CAM.
- Topics/Tags:
audio,machine-learning,deep-learning,pytorch,torchaudio,embeddings,faiss,qdrant,streamlit,fastapi,mlflow,grad-cam,esc50,urbansound8k
Audio (wav/mp3)
└─ resample 16 kHz mono
└─ log-mel spectrogram (80–128 bins)
└─ Embedding extractor (PANNs/YAMNet/OpenL3) → fixed-size vector
├─ Classifier head (MLP/LogReg) → multi-label tags
└─ Vector index (FAISS/Qdrant) → top‑K neighbors
- ESC‑50 (50 classes, 2k clips, 5 folds) — environmental sounds
- UrbanSound8K (10 classes, 8k clips) — urban events/noises
- Optional: FSD50K subset, GTZAN (music), SpeechCommands (keywords)
You can also drop your own
.wav/.mp3intodata/custom/and use the same pipeline.
.
├─ api/
│ ├─ main.py # FastAPI endpoints: /embed, /tag, /search
│ ├─ models.py # pydantic schemas
│ └─ utils.py # audio IO, preproc, postproc
├─ ui/
│ └─ app.py # Streamlit demo
├─ src/
│ ├─ data/
│ │ ├─ prepare_esc50.py
│ │ └─ prepare_us8k.py
│ ├─ features/
│ │ ├─ embedder_panns.py # CNN14 (PANNs) – default
│ │ └─ embedder_yamnet.py # optional
│ ├─ models/
│ │ ├─ classifier.py # MLP/LogReg multi-label head
│ │ └─ grad_cam.py # Grad-CAM on spectrograms
│ ├─ index/
│ │ └─ faiss_index.py
│ ├─ training/
│ │ ├─ train_classifier.py
│ │ └─ eval_metrics.py
│ └─ infer/
│ └─ pipeline.py # end-to-end: audio → tags/similar
├─ notebooks/ # EDA / experiments (optional)
├─ artifacts/ # cached embeddings, trained heads, FAISS index
├─ configs/
│ ├─ config.yaml # paths, model, params
│ └─ classes_esc50.yaml # label mapping
├─ tests/ # unit tests (basic IO/shape/metrics)
├─ docker/
│ ├─ Dockerfile.api
│ ├─ Dockerfile.ui
│ └─ docker-compose.yml
├─ requirements.txt
├─ environment.yml
├─ .gitignore
├─ LICENSE
└─ README.md
# Option A: conda
conda env create -f environment.yml
conda activate audiohub
# Option B: venv
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\\Scripts\\activate
pip install -r requirements.txt# ESC-50
python -m src.data.prepare_esc50 --root data/esc50 --download
# UrbanSound8K
python -m src.data.prepare_us8k --root data/us8k --downloadpython -m src.infer.pipeline \
--mode embed \
--input_dir data/esc50/audio \
--out artifacts/embeddings/esc50_panns.npy \
--backend panns --sr 16000 --n_mels 96python -m src.training.train_classifier \
--embeddings artifacts/embeddings/esc50_panns.npy \
--labels data/esc50/labels.csv \
--model_out artifacts/models/mlp_esc50.pt \
--mlflow_run_name "mlp_esc50_panns"python -m src.index.faiss_index \
--embeddings artifacts/embeddings/esc50_panns.npy \
--index_out artifacts/index/faiss_esc50.idx \
--type flat --topk 10# API
uvicorn api.main:app --reload --port 8000
# UI (in another terminal)
streamlit run ui/app.pyNow open Streamlit and try:
- Upload
.wav/.mp3 - See predicted tags (top‑K with scores), top‑K similar clips, and a Grad‑CAM heatmap.
POST /embed— returns embedding for uploaded audioPOST /tag— returns multi-label tag probabilitiesPOST /search— returns top‑K nearest neighbors (ids + distances)
Payloads are documented via built-in Swagger at /docs.
- Classification: macro‑F1, mAP, confusion matrix per dataset fold
- Similarity: Recall@K, nDCG@K (using class labels or curated pairs)
- Ablations: PANNs vs YAMNet; with/without SpecAugment; index types
Run:
python -m src.training.eval_metrics --help- Fixed
RANDOM_SEED=42across PyTorch/NumPy/torch.backends.cudnn. - Deterministic resampling to 16 kHz mono; fixed mel config.
- Embedding cache versioned under
artifacts/with MLflow runs. - Model cards + data cards templates under
docs/(optional).
# Build
docker compose -f docker/docker-compose.yml build
# Run
docker compose -f docker/docker-compose.yml upThis starts two containers: api (FastAPI) and ui (Streamlit), sharing the same artifacts volume.
All core params live in configs/config.yaml:
- paths: data roots, artifacts, model/index outputs
- audio: sr, n_mels, hop_length, win_length
- model: backend (
panns,yamnet), classifier head params - index:
faisstype, top‑K - eval: metrics, folds
- Add Qdrant backend in parallel to FAISS
- Export to ONNX/TorchScript for edge inference
- Add OpenL3/CLAP as optional embedders
- Batch tagging/search API with rate limiting
- Unit tests for Grad‑CAM correctness (saliency sanity check)
- Example notebook with qualitative analysis
MIT — see LICENSE. Please check dataset licenses separately.
- PANNs: Kong et al., PANNs: Large-Scale Pretrained Audio Neural Networks.
- YAMNet: Google Research (MobileNetV1-based audio event classifier).
- FAISS: Facebook AI Similarity Search.