Skip to content

switchbm/whisperx-audio-transcriber

Repository files navigation

πŸŽ™οΈ WhisperX Audio Transcriber

Python 3.9+ License: MIT Code style: black

A powerful, production-ready audio transcription tool built on WhisperX that provides state-of-the-art speech-to-text with speaker diarization (speaker identification). Perfect for transcribing meetings, interviews, podcasts, lectures, and any multi-speaker audio content.

✨ Features

  • πŸš€ High-accuracy transcription using OpenAI's Whisper models
  • πŸ‘₯ Speaker diarization to identify who is speaking when
  • πŸ“„ Multiple output formats: TXT, JSON, SRT, VTT
  • ⚑ Batch processing for multiple files
  • πŸ–₯️ GPU acceleration with automatic CPU fallback
  • ⏱️ Word-level timestamps for precise alignment
  • 🌍 Multi-language support with auto-detection
  • πŸ“Š Progress tracking with timestamps and progress bars
  • πŸ›‘οΈ Robust error handling and validation

🎯 Demo

# Single command to transcribe with speaker identification
python main.py --audio meeting.mp3

# Output:
[2025-09-29 12:44:39] Processing: meeting.mp3
[2025-09-29 12:44:39] βœ… Transcription complete!
[2025-09-29 12:44:39]    Duration: 47.59s
[2025-09-29 12:44:39]    Language: en
[2025-09-29 12:44:39]    Speakers: 2
[2025-09-29 12:44:39]    Segments: 20

Result:

[00:00:00.622 --> 00:00:01.563] SPEAKER_00: Hello, how's it going?
[00:00:02.524 --> 00:00:03.406] SPEAKER_01: Good, how are you?
[00:00:04.267 --> 00:00:06.270] SPEAKER_00: Do you want to talk about your insurance policy?
[00:00:06.530 --> 00:00:07.613] SPEAKER_01: Yes, I would like that.

🎡 Supported Audio Formats

  • MP3 (.mp3) - Most common format
  • WAV (.wav) - Uncompressed audio
  • M4A (.m4a) - Apple/iTunes format
  • FLAC (.flac) - Lossless compression
  • OGG (.ogg) - Open source format
  • MP4 (.mp4) - Video files (audio extracted)

πŸš€ Quick Start

1. Install Dependencies

macOS:

# Install FFmpeg
brew install ffmpeg

# Clone repository
git clone https://github.com/yourusername/whisperx-audio-transcriber.git
cd whisperx-audio-transcriber

# Install Python dependencies
pip install -e .

Ubuntu/Debian:

# Install FFmpeg
sudo apt update && sudo apt install ffmpeg

# Clone and install
git clone https://github.com/yourusername/whisperx-audio-transcriber.git
cd whisperx-audio-transcriber
pip install -e .

2. Set Up Hugging Face Token (for speaker diarization)

πŸ”‘ Speaker diarization requires accepting Pyannote model terms

  1. Create Hugging Face account: huggingface.co
  2. Generate access token:
  3. ⚠️ IMPORTANT: Accept model terms (must be done while logged in):
  4. Configure token:
# Option 1: Environment variable
export HF_TOKEN=hf_your_token_here

# Option 2: .env file (recommended)
echo "TOKEN=hf_your_token_here" > .env

βœ… Verification: On first run, you should see "Diarization model loaded successfully"

3. Run Your First Transcription

# Single file
python main.py --audio your_audio.mp3

# Batch processing
python main.py --batch /path/to/audio/folder/

# Advanced options
python main.py --audio meeting.wav --model large-v2 --min_speakers 2 --max_speakers 5

πŸ“Š Model Performance Comparison

Model Size Speed Accuracy Memory Best For
tiny 39MB πŸš€πŸš€πŸš€ ⭐⭐⭐ 1GB Quick drafts
base 74MB πŸš€πŸš€ ⭐⭐⭐⭐ 1GB Recommended
small 244MB πŸš€ ⭐⭐⭐⭐ 2GB Better accuracy
medium 769MB 🐌 ⭐⭐⭐⭐⭐ 5GB High accuracy
large-v3 1.5GB 🐌🐌 ⭐⭐⭐⭐⭐ 10GB Maximum quality

Speed relative to real-time on GPU

πŸ”§ Advanced Usage

Command Line Options

python main.py [OPTIONS]

Required (choose one):
  --audio FILE          Single audio file
  --batch DIRECTORY     Process all audio files in directory

Model Options:
  --model SIZE          tiny, base, small, medium, large-v2, large-v3 (default: base)
  --language CODE       Language code (en, es, fr, etc.) or auto-detect
  --device DEVICE       cpu, cuda, or auto-detect (default: auto)

Output Options:
  --output_format FORMAT  txt, json, srt, vtt, all (default: all)
  --output_dir DIR        Output directory (default: output)

Speaker Options:
  --min_speakers N      Minimum speakers for diarization
  --max_speakers N      Maximum speakers for diarization

Other:
  --verbose            Detailed logging and error traces

Real-World Examples

πŸ“ž Customer Service Call:

python main.py --audio customer_call.mp3 --min_speakers 2 --max_speakers 2 --output_format json

🎀 Podcast Episode:

python main.py --audio podcast_ep1.mp3 --model medium --language en --output_format srt

πŸ‘₯ Team Meeting:

python main.py --audio standup.wav --min_speakers 4 --max_speakers 8 --verbose

πŸ“š Lecture Series (Batch):

python main.py --batch ./lectures/ --model small --output_dir ./transcripts/

πŸ“ Output Formats

πŸ“ TXT - Human Readable

[00:00:00.000 --> 00:00:05.240] SPEAKER_00: Welcome to today's meeting.
[00:00:05.580 --> 00:00:08.920] SPEAKER_01: Thanks for having me here.

πŸ”— JSON - Structured Data

{
  "metadata": {
    "audio_file": "meeting.mp3",
    "duration": 1847.2,
    "language": "en",
    "speakers_detected": 3,
    "model": "base"
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.24,
      "speaker": "SPEAKER_00",
      "text": "Welcome to today's meeting."
    }
  ]
}

🎬 SRT - Subtitles

1
00:00:00,000 --> 00:00:05,240
SPEAKER_00: Welcome to today's meeting.

2
00:00:05,580 --> 00:00:08,920
SPEAKER_01: Thanks for having me here.

🌐 VTT - Web Subtitles

WEBVTT

00:00:00.000 --> 00:00:05.240
SPEAKER_00: Welcome to today's meeting.

00:00:05.580 --> 00:00:08.920
SPEAKER_01: Thanks for having me here.

⚑ Performance Tips

πŸ–₯️ Hardware Optimization

  • GPU: 5-10x faster than CPU
  • Memory: 8GB+ RAM recommended for large models
  • Storage: SSD preferred for large batch jobs

πŸŽ›οΈ Settings Optimization

# Fast processing (good quality)
python main.py --audio file.mp3 --model base --device cuda

# Maximum quality (slower)
python main.py --audio file.mp3 --model large-v3 --language en

# Batch optimization
python main.py --batch ./files/ --model small --output_format txt

πŸ› οΈ Troubleshooting

Common Issues & Solutions

Problem Solution
πŸ”’ Permission denied with uv export TMPDIR=/tmp
🚫 FFmpeg not found Install FFmpeg: brew install ffmpeg
⚠️ CUDA out of memory Use --model small or --device cpu
πŸ‘₯ Poor speaker separation Add --min_speakers 2 --max_speakers 4
🐌 Slow processing Use smaller model or enable GPU
πŸ”‘ HF_TOKEN missing Set up Hugging Face token (see setup)

πŸ“Š First Run Information

  • Download time: 2-5 minutes (depending on internet)
  • Model size: ~2GB total for all components
  • Cache location: ~/.cache/whisperx/
  • Subsequent runs: Much faster (models cached)

πŸ—οΈ Development

πŸ§ͺ Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ --cov=.

# Code formatting
black . && isort . && flake8 .

πŸš€ Serverless Deployment

See preload_models.py and serverless_transcriber.py for optimized deployment patterns.

🀝 Contributing

We welcome contributions! Please see our contributing guidelines:

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch (git checkout -b feature/amazing-feature)
  3. ✍️ Make your changes with tests
  4. βœ… Ensure tests pass (pytest tests/)
  5. πŸ“ Update documentation if needed
  6. πŸš€ Submit a pull request

πŸ’‘ Ideas for Contributions

  • 🌍 Additional language support
  • πŸ“± Web interface
  • 🎨 GUI application
  • πŸ“Š Better visualization
  • 🐳 Docker containerization
  • ☁️ Cloud deployment guides

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

βš–οΈ Important Notes

  • Code: MIT License (free for commercial use)
  • Models: pyannote.audio models require separate licensing for commercial use
  • Research use: Completely free
  • Commercial use: Contact pyannote team for model licensing

πŸ™ Acknowledgments

This project builds upon incredible work from:

⭐ Star History

Star History Chart


Made with ❀️ by [Your Name]

⭐ Star this repo β€’ πŸ› Report Bug β€’ πŸ’‘ Request Feature

About

WhisperX Audio Transcriber with Speaker Diarization

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages