Skip to content

jR4dh3y/unet-audiofilter

Repository files navigation

UNet Audio Filter – AI Speech Enhancement

GPU-accelerated speech enhancement using a U-Net model

Overview

This repo provides a complete pipeline to remove background noise from speech using a U-Net model trained on the VoiceBank+DEMAND dataset. It includes a CLI inference tool, a Streamlit app, a training notebook, and a global path system that “just works” after clone.

Repository structure

unet-audiofilter/
├── apps/
│   ├── __init__.py
│   ├── requirements.txt
│   └── streamlit_app.py          # Web UI for enhancement
├── config/
│   ├── __init__.py
│   └── paths.py                  # Auto-detected project paths
├── dataset/                      # Sample/test data and scp files
│   ├── clean_testset_wav/
│   ├── clean_trainset_28spk_wav/
│   ├── noisy_testset_wav/
│   ├── noisy_trainset_28spk_wav/
│   ├── test.scp
│   └── train.scp
├── models/                       # Model checkpoints
│   ├── best_gpu_model.pth
│   ├── checkpoint_epoch_5.pth
│   └── checkpoint_epoch_10.pth
├── notebooks/
│   └── main_training.ipynb       # End-to-end training pipeline
├── presentation/                 # Assets for docs / slides
├── results/                      # Training curves, analysis, samples
├── scripts/
│   ├── __init__.py
│   ├── inference.py              # CLI inference entrypoint
│   ├── setup_environment.py      # Validation / .env generator
│   └── train.py                  # Redirects to notebook
├── src/
│   ├── __init__.py
│   ├── audio_utils.py            # ffmpeg-based audio I/O helpers
│   ├── unet_model.py             # U-Net model + loss
│   └── utils.py                  # Plotting, helpers, metrics
├── tools/
│   ├── __init__.py
│   ├── test_quality.py
│   └── test_system.py
├── config.yaml                   # Default config (dataset/model/training)
├── requirements.txt
├── req-aur.txt                   # Arch/Manjaro package hints
├── run_app.sh                    # Start Streamlit app
├── run_inference.sh              # Run CLI inference
├── run_tests.sh                  # Run system + quality tests
├── setup.py                      # Wrapper to scripts/setup_environment.py
├── LICENSE
└── README.md

Installation

Install ffmpeg and requirements.txt

Quick start

Validate paths (auto-detect root and create missing dirs):

python -c "from config.paths import quick_setup; quick_setup()"

Run tests:

./run_tests.sh

Run inference:

./run_inference.sh input.wav output.wav

Launch the web app:

./run_app.sh

Audio & spectrogram comparison

Spectrograms for a representative sample (p232_010):

Spectrogram comparison – p232_010

Listen to the audio examples:

Noisy vs Enhanced vs Clean (click to expand)
<p><strong>Noisy input</strong></p>
<audio controls>
	<source src="results/comparison/p232_010_noisy.wav" type="audio/wav" />
	Your browser does not support the audio element.
</audio>

<p><strong>Enhanced output</strong></p>
<audio controls>
	<source src="results/comparison/p232_010_enhanced.wav" type="audio/wav" />
	Your browser does not support the audio element.
</audio>

<p><strong>Clean reference</strong></p>
<audio controls>
	<source src="results/comparison/p232_010_clean.wav" type="audio/wav" />
	Your browser does not support the audio element.
</audio>

Inference (CLI)

The CLI uses a GPU if available and expects a pretrained checkpoint at models/best_gpu_model.pth by default.

python scripts/inference.py noisy.wav enhanced.wav --device auto

Notes:

  • Spectrogram settings: n_fft=1024, hop_length=256.
  • Processing is done in 4s chunks with 25% overlap and crossfade.
  • The shipped scripts/app instantiate UNet with base_filters=32, depth=3, dropout=0.1 to match the provided checkpoint.

Streamlit app

Start the app with ./run_app.sh, upload an audio file, and download the enhanced result. The app shows waveform and spectrogram comparisons and basic quality metrics.

Training

Training is provided as a Jupyter notebook for clarity and interactivity:

  1. Open notebooks/main_training.ipynb.
  2. Run cells in sequence; checkpoints are saved under models/.

The helper script scripts/train.py prints directions to the notebook. The config.yaml shows a reference model config (base_filters=64, depth=4, dropout=0.2) you can adopt when retraining.

Requirements

  • Python 3.9+
  • PyTorch (CUDA optional but recommended for speed)
  • ffmpeg (system package) for robust multi-format audio I/O

Hardware guidance:

  • GPU with ≥4GB VRAM recommended for training (CPU works for inference)
  • For Arch users, see req-aur.txt for curated package names

Troubleshooting

  • Model not found: ensure models/best_gpu_model.pth exists or pass --model /path/to/checkpoint.pth.
  • Import/path errors: set UNET_AUDIOFILTER_ROOT to your repo root and retry quick_setup().
  • ffmpeg errors: install ffmpeg via your OS package manager and restart the session.
  • CUDA not used: pass --device cpu to force CPU or ensure CUDA drivers/toolkit are installed for GPU.

Contributing

Pull requests are welcome. Please run ./run_tests.sh before submitting.

License

MIT License © 2025 Radhey Kalra

About

U-Net based speech enhancement and noise reduction pipeline for audio

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published