Clean audiobooks from background music and noise using MDX-Net, VR, and Roformer neural network models.
Python wrapper over audio-separator — the same models used by UVR5, but without issues with large files and 32-bit architecture.
- Separate narrator's voice from background music
- Remove noise (hiss, hum, interference)
- Remove echo and reverberation
- Automatic splitting of large files
- Batch processing of directories
- Works on CPU and GPU (NVIDIA CUDA)
- Supported formats: MP3, WAV, FLAC, M4A, OGG, OPUS
pip install audio-separator[cpu]
# Windows (winget)
winget install ffmpeg
# Windows (chocolatey)
choco install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
For 2-3x speed boost on NVIDIA GPUs:
# 1. PyTorch with CUDA
pip uninstall torch torchaudio torchvision -y
pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu121
# 2. ONNX Runtime with CUDA
pip install onnxruntime-gpu==1.18.1
# 3. cuDNN (required for ONNX Runtime)
pip install nvidia-cudnn-cu12==8.9.7.29
# PyTorch sees CUDA
python -c "import torch; print('CUDA:', torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else '')"
# ONNX Runtime sees CUDA
python -c "import onnxruntime as ort; print('Providers:', ort.get_available_providers())"
Expected output:
CUDA: True NVIDIA GeForce ...Providers: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
# Process single file
python audiobook_cleaner.py audiobook.mp3
# Process directory
python audiobook_cleaner.py --dir /path/to/audiobooks
For files >100 MB, use chunking:
# Split into 5-minute chunks (recommended)
python audiobook_cleaner.py audiobook.mp3 --chunk-duration 300 --no-denoise
# Split into 10-minute chunks (if you have plenty of RAM)
python audiobook_cleaner.py audiobook.mp3 --chunk-duration 600 --no-denoise
The script will automatically split the file, process chunks, and merge the result.
# List available models
python audiobook_cleaner.py --list-models
# Use specific model
python audiobook_cleaner.py audiobook.mp3 --model vocals_kim
Arguments:
input Input audio file
--dir, -d Process all files in directory
--output, -o Output directory
--model, -m Separation model (default: vocals)
--no-denoise Skip additional denoising
--cpu Force CPU usage
--format, -f Output format: mp3, wav, flac (default: mp3)
--sample-rate, -sr Sample rate (default: 44100)
--segment-size, -s Segment size in seconds (default: 64)
--chunk-duration, -c Split large files by N seconds (default: 0 = off)
--list-models, -l Show available models
| Shortcut | Model | Description |
|---|---|---|
vocals |
UVR-MDX-NET-Voc_FT.onnx | Default, good balance |
vocals_kim |
Kim_Vocal_2.onnx | Alternative, sometimes better |
roformer |
BS-Roformer | Best quality, but slower and needs more VRAM |
| Shortcut | Model | Description |
|---|---|---|
denoise |
UVR-DeNoise.pth | Standard denoising |
denoise_lite |
UVR-DeNoise-Lite.pth | Lightweight version |
roformer_denoise |
Mel-Roformer-Denoise | Best quality |
| Shortcut | Model | Description |
|---|---|---|
dereverb |
UVR-DeEcho-DeReverb.pth | Echo + reverb |
dereverb_normal |
UVR-De-Echo-Normal.pth | Echo only |
# Quick processing without denoising
python audiobook_cleaner.py book.mp3 --no-denoise
# Large file — split into 5-minute chunks
python audiobook_cleaner.py book.mp3 --chunk-duration 300 --no-denoise
# Maximum quality (slow)
python audiobook_cleaner.py book.mp3 --model roformer --chunk-duration 300
# Processing on low-end machine
python audiobook_cleaner.py book.mp3 --chunk-duration 120 --segment-size 16 --cpu --no-denoise
# Batch processing to WAV
python audiobook_cleaner.py --dir ./books --output ./clean --format wav
# Clean cockpit voice recorder (CVR) from noise
python audiobook_cleaner.py cvr.mp3 --model denoise --no-denoise
Approximate processing time for 5-minute chunk (300 sec):
| Configuration | Time | Notes |
|---|---|---|
| CPU (i7-8700) | 2:30 | No GPU |
| GPU GTX 1660 Super (PyTorch) | 1:40 | ONNX → PyTorch conversion |
| GPU GTX 1660 Super (ONNX) | 0:30-0:40 | Full acceleration |
For 7-hour audiobook (85 chunks of 5 min):
- CPU: ~3.5 hours
- GPU (PyTorch): ~2.5 hours
- GPU (ONNX): ~45-60 minutes
- Step 1: MDX-Net/Roformer model separates audio into "vocals" and "instrumental"
- Step 2 (optional): DeNoise model removes residual noise
Output files:
*_(Vocals).mp3— clean narrator's voice*_(Instrumental).mp3— separated music/noise
- Python 3.9+
- FFmpeg
- 8+ GB RAM (16 GB recommended for large files)
- NVIDIA GPU with 4+ GB VRAM (optional, for acceleration)
# Use chunking + small segment size
python audiobook_cleaner.py book.mp3 --chunk-duration 300 --segment-size 16 --no-denoise
Models are downloaded automatically. Check your internet connection.
- Verify installation:
python -c "import torch; print(torch.cuda.is_available())"
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
- If
CUDAExecutionProvideris missing:
pip uninstall onnxruntime onnxruntime-gpu -y
pip install onnxruntime-gpu==1.18.1
pip install nvidia-cudnn-cu12==8.9.7.29
cuDNN is missing. Install:
pip install nvidia-cudnn-cu12==8.9.7.29
ONNX Runtime can't find CUDA DLLs. Options:
- Install cuDNN via pip (see above)
- Or install CUDA Toolkit system-wide
If you see in logs:
Model converted from onnx to pytorch due to segment size not matching dim_t
The model is being converted to PyTorch, which is slower. This is normal for some configurations.
MIT
- python-audio-separator
- Ultimate Vocal Remover
- Authors of MDX-Net, Demucs, and Roformer models