A powerful, production-ready audio transcription tool built on WhisperX that provides state-of-the-art speech-to-text with speaker diarization (speaker identification). Perfect for transcribing meetings, interviews, podcasts, lectures, and any multi-speaker audio content.
- π High-accuracy transcription using OpenAI's Whisper models
- π₯ Speaker diarization to identify who is speaking when
- π Multiple output formats: TXT, JSON, SRT, VTT
- β‘ Batch processing for multiple files
- π₯οΈ GPU acceleration with automatic CPU fallback
- β±οΈ Word-level timestamps for precise alignment
- π Multi-language support with auto-detection
- π Progress tracking with timestamps and progress bars
- π‘οΈ Robust error handling and validation
# Single command to transcribe with speaker identification
python main.py --audio meeting.mp3
# Output:
[2025-09-29 12:44:39] Processing: meeting.mp3
[2025-09-29 12:44:39] β
Transcription complete!
[2025-09-29 12:44:39] Duration: 47.59s
[2025-09-29 12:44:39] Language: en
[2025-09-29 12:44:39] Speakers: 2
[2025-09-29 12:44:39] Segments: 20Result:
[00:00:00.622 --> 00:00:01.563] SPEAKER_00: Hello, how's it going?
[00:00:02.524 --> 00:00:03.406] SPEAKER_01: Good, how are you?
[00:00:04.267 --> 00:00:06.270] SPEAKER_00: Do you want to talk about your insurance policy?
[00:00:06.530 --> 00:00:07.613] SPEAKER_01: Yes, I would like that.
- MP3 (.mp3) - Most common format
- WAV (.wav) - Uncompressed audio
- M4A (.m4a) - Apple/iTunes format
- FLAC (.flac) - Lossless compression
- OGG (.ogg) - Open source format
- MP4 (.mp4) - Video files (audio extracted)
macOS:
# Install FFmpeg
brew install ffmpeg
# Clone repository
git clone https://github.com/yourusername/whisperx-audio-transcriber.git
cd whisperx-audio-transcriber
# Install Python dependencies
pip install -e .Ubuntu/Debian:
# Install FFmpeg
sudo apt update && sudo apt install ffmpeg
# Clone and install
git clone https://github.com/yourusername/whisperx-audio-transcriber.git
cd whisperx-audio-transcriber
pip install -e .π Speaker diarization requires accepting Pyannote model terms
- Create Hugging Face account: huggingface.co
- Generate access token:
- Go to Settings > Access Tokens
- Click "New Token" β Choose "Read" access β Copy token
β οΈ IMPORTANT: Accept model terms (must be done while logged in):- Visit pyannote/speaker-diarization-3.1 β Click "Agree and access repository"
- Visit pyannote/segmentation-3.0 β Click "Agree and access repository"
- Configure token:
# Option 1: Environment variable
export HF_TOKEN=hf_your_token_here
# Option 2: .env file (recommended)
echo "TOKEN=hf_your_token_here" > .envβ Verification: On first run, you should see "Diarization model loaded successfully"
# Single file
python main.py --audio your_audio.mp3
# Batch processing
python main.py --batch /path/to/audio/folder/
# Advanced options
python main.py --audio meeting.wav --model large-v2 --min_speakers 2 --max_speakers 5| Model | Size | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|---|
tiny |
39MB | πππ | βββ | 1GB | Quick drafts |
base |
74MB | ππ | ββββ | 1GB | Recommended |
small |
244MB | π | ββββ | 2GB | Better accuracy |
medium |
769MB | π | βββββ | 5GB | High accuracy |
large-v3 |
1.5GB | ππ | βββββ | 10GB | Maximum quality |
Speed relative to real-time on GPU
python main.py [OPTIONS]
Required (choose one):
--audio FILE Single audio file
--batch DIRECTORY Process all audio files in directory
Model Options:
--model SIZE tiny, base, small, medium, large-v2, large-v3 (default: base)
--language CODE Language code (en, es, fr, etc.) or auto-detect
--device DEVICE cpu, cuda, or auto-detect (default: auto)
Output Options:
--output_format FORMAT txt, json, srt, vtt, all (default: all)
--output_dir DIR Output directory (default: output)
Speaker Options:
--min_speakers N Minimum speakers for diarization
--max_speakers N Maximum speakers for diarization
Other:
--verbose Detailed logging and error tracesπ Customer Service Call:
python main.py --audio customer_call.mp3 --min_speakers 2 --max_speakers 2 --output_format jsonπ€ Podcast Episode:
python main.py --audio podcast_ep1.mp3 --model medium --language en --output_format srtπ₯ Team Meeting:
python main.py --audio standup.wav --min_speakers 4 --max_speakers 8 --verboseπ Lecture Series (Batch):
python main.py --batch ./lectures/ --model small --output_dir ./transcripts/[00:00:00.000 --> 00:00:05.240] SPEAKER_00: Welcome to today's meeting.
[00:00:05.580 --> 00:00:08.920] SPEAKER_01: Thanks for having me here.
{
"metadata": {
"audio_file": "meeting.mp3",
"duration": 1847.2,
"language": "en",
"speakers_detected": 3,
"model": "base"
},
"segments": [
{
"start": 0.0,
"end": 5.24,
"speaker": "SPEAKER_00",
"text": "Welcome to today's meeting."
}
]
}1
00:00:00,000 --> 00:00:05,240
SPEAKER_00: Welcome to today's meeting.
2
00:00:05,580 --> 00:00:08,920
SPEAKER_01: Thanks for having me here.WEBVTT
00:00:00.000 --> 00:00:05.240
SPEAKER_00: Welcome to today's meeting.
00:00:05.580 --> 00:00:08.920
SPEAKER_01: Thanks for having me here.- GPU: 5-10x faster than CPU
- Memory: 8GB+ RAM recommended for large models
- Storage: SSD preferred for large batch jobs
# Fast processing (good quality)
python main.py --audio file.mp3 --model base --device cuda
# Maximum quality (slower)
python main.py --audio file.mp3 --model large-v3 --language en
# Batch optimization
python main.py --batch ./files/ --model small --output_format txt| Problem | Solution |
|---|---|
| π Permission denied with uv | export TMPDIR=/tmp |
| π« FFmpeg not found | Install FFmpeg: brew install ffmpeg |
Use --model small or --device cpu |
|
| π₯ Poor speaker separation | Add --min_speakers 2 --max_speakers 4 |
| π Slow processing | Use smaller model or enable GPU |
| π HF_TOKEN missing | Set up Hugging Face token (see setup) |
- Download time: 2-5 minutes (depending on internet)
- Model size: ~2GB total for all components
- Cache location:
~/.cache/whisperx/ - Subsequent runs: Much faster (models cached)
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ --cov=.
# Code formatting
black . && isort . && flake8 .See preload_models.py and serverless_transcriber.py for optimized deployment patterns.
We welcome contributions! Please see our contributing guidelines:
- π΄ Fork the repository
- πΏ Create a feature branch (
git checkout -b feature/amazing-feature) - βοΈ Make your changes with tests
- β
Ensure tests pass (
pytest tests/) - π Update documentation if needed
- π Submit a pull request
- π Additional language support
- π± Web interface
- π¨ GUI application
- π Better visualization
- π³ Docker containerization
- βοΈ Cloud deployment guides
This project is licensed under the MIT License - see the LICENSE file for details.
- Code: MIT License (free for commercial use)
- Models: pyannote.audio models require separate licensing for commercial use
- Research use: Completely free
- Commercial use: Contact pyannote team for model licensing
This project builds upon incredible work from:
- π― WhisperX - Core transcription engine
- π€ OpenAI Whisper - Speech recognition models
- π₯ pyannote.audio - Speaker diarization
- β‘ PyTorch - Machine learning framework
Made with β€οΈ by [Your Name]
β Star this repo β’ π Report Bug β’ π‘ Request Feature