A Python application for real-time audio transcription and speaker diarization using Faster-Whisper and PyAnnote.
- Real-time audio transcription with Apple Silicon support (MPS)
- Advanced speaker diarization using PyAnnote
- Support for microphone and audio file input
- Multiple Whisper model sizes and quantization options
- Configurable via JSON with text or JSON output formats
- Thread-safe parallel processing of transcription and diarization
- Python 3.10
- FFmpeg (required for audio processing)
# On macOS using Homebrew brew install ffmpeg # On Ubuntu/Debian sudo apt-get install ffmpeg # On Windows using Chocolatey choco install ffmpeg
- Apple Silicon Mac recommended for optimal performance with MPS acceleration
git clone https://github.com/francescopace/whisperize.git
cd whisperizepython -m venv .venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windowspip install -r requirements.txt- Create a HuggingFace account at https://huggingface.co/
- Generate an access token at https://huggingface.co/settings/tokens
- Edit
config.jsonand update with your settings:
{
"huggingface_token": "your_token_here",
"output_folder": "transcripts/",
"output_format": "text",
"model": "turbo",
"whisper_force_cpu": false,
"language": "it",
"buffer_duration": 4
}- huggingface_token (required): Your HuggingFace API token for accessing PyAnnote models
- output_folder (required): Directory where transcripts will be saved
- output_format (optional): Output format -
"text"or"json"(default:"text")text: Creates a human-readable transcript with timestampsjson: Creates both a text file and a structured JSON file with metadata and word-level timestamps
- model (optional): Whisper model size -
"tiny","base","small","medium","large","turbo"(default:"base") - whisper_force_cpu (optional): Force CPU usage even if GPU/MPS is available (default:
false) - language (optional): Language code (e.g.,
"it","en","es"). If not specified, language is auto-detected - buffer_duration (optional): Audio buffer duration in seconds (default:
5.0)
The application uses Faster-Whisper for transcription. Available models:
tiny- Fastest, lowest accuracybase- Good balance of speed and accuracysmall- Better accuracy, slowermedium- High accuracylarge- Highest accuracy, slowestturbo- Optimized large model
See Whisper documentation for language support and model details.
The application uses PyAnnote speaker-diarization-3.1 for speaker identification. This model is automatically loaded and requires a HuggingFace token for access.
Microphone Input (default):
python whisperize.py
# or explicitly
python whisperize.py microphoneAudio File Input:
python whisperize.py path/to/audio.wavNote: Only WAV files (16-bit, mono or stereo) are currently supported.
Transcripts are saved in the output_folder specified in config.json:
Text Format (output_format: "text"):
# Transcript started at 2025-02-11 18:30:00
[00:00:02.500-00:00:05.300] [SPEAKER_00]: Hello, this is a test transcription.
[00:00:06.100-00:00:09.800] [SPEAKER_01]: Yes, I can hear you clearly.
JSON Format (output_format: "json"):
- Creates both a
.txtfile (for real-time monitoring) and a.jsonfile - JSON includes metadata, speaker labels, timestamps, and word-level details with confidence scores
Example JSON structure:
{
"metadata": {
"start_time": "2025-02-11T18:30:00",
"duration": 120.5,
"model": "turbo",
"language": "it",
"source": "microphone"
},
"segments": [
{
"speaker": "SPEAKER_00",
"start": 2.500,
"end": 5.300,
"text": "Hello, this is a test transcription.",
"words": [
{
"word": "Hello",
"start": 2.500,
"end": 2.800,
"probability": 0.95
}
]
}
]
}