A full pipeline to detect and highlight active speakers in videos using YOLO for face detection and TalkNet for speaker detection.
git clone https://github.com/MjdMahasneh/active-speaker-detection.git
cd active-speaker-detectioncurl -LsSf https://astral.sh/uv/install.sh | shuv syncThis will create a virtual environment and install all dependencies including PyTorch with CUDA support.
To run the pipeline, modify configurations in ./config/args.py and then run main.py. Alternatively, you can run the script directly with command line arguments.
uv run python main.py --videoName video --videoFolder workdirFor better performance with GPU, increase batch size and data loader threads:
uv run python main.py --videoName video --videoFolder workdir --yoloBatchSize 64 --nDataLoaderThread 16For faster processing without video visualization (metadata only):
uv run python main.py --videoName video --videoFolder workdir --metadataOnlyTo use a larger YOLO model for better accuracy:
uv run python main.py --videoName video --videoFolder workdir --yoloVariant mNote: video can be in .mp4 or .avi formats.
workdir/
└── video/
├── pyavi/ # extracted audio + output video
├── pyframes/ # all video frames (JPEG format)
├── pycrop/ # cropped face clips
└── pywork/
├── tracks.pckl # face tracks
├── scores.pckl # speaking scores
├── speaker_summary.json # summary of speaker activity
└── frame_metadata.json # frame-centric metadata
- YOLOv11-Face: Face detection (variants: n/s/m/l/x for speed vs accuracy tradeoff)
- TalkNet: Audio-visual active speaker detection
| Option | Default | Description |
|---|---|---|
--videoName |
video |
Input video name (without extension) |
--videoFolder |
workdir |
Path for inputs and outputs |
--yoloVariant |
n |
YOLO variant: n (nano), s (small), m (medium), l (large), x (extra-large) |
--yoloBatchSize |
32 |
Batch size for face detection (increase for better GPU utilization) |
--nDataLoaderThread |
10 |
Number of parallel workers for video cropping |
--speakerThresh |
0.6 |
Speaker detection confidence threshold |
--minSpeechLen |
0.25 |
Minimum speech duration (seconds) to count as speaking |
--ignoreMultiSpeakers |
False |
Skip frames with multiple speakers in visualization |
--jpegQscale |
2 |
JPEG quality scale for extracted frames (1-31, lower=better) |
--metadataOnly |
False |
Skip video visualization, only produce JSON metadata |
--useBatched |
False |
Use batched TalkNet inference (2-3x faster) |
--talknetBatchSize |
16 |
Batch size for TalkNet when using --useBatched |
- Scene detection via
PySceneDetect - Face detection via YOLO (batched inference)
- Face tracking via IOU + interpolation
- Speech classification via TalkNet
- Visualization with speaking durations
For maximum speed on a GPU with 24GB+ VRAM (e.g., L4, A100):
uv run python main.py --videoName video --videoFolder workdir \
--yoloBatchSize 128 \
--useBatched \
--talknetBatchSize 32 \
--metadataOnly| Tier | Speedup | Command Additions | Best For |
|---|---|---|---|
| Baseline | 1x | (none) | Debugging, small videos |
| Fast | 2-3x | --useBatched |
General use |
| Faster | 4-6x | --useBatched --yoloBatchSize 64 |
Production |
| Maximum | 6-10x | --useBatched --yoloBatchSize 128 --metadataOnly |
Batch processing |
Processes multiple face tracks in parallel instead of sequentially. Recommended for all use cases.
uv run python main.py --videoName video --videoFolder workdir --useBatchedHigher batch sizes improve GPU utilization for face detection:
- 16GB VRAM:
--yoloBatchSize 64 - 24GB VRAM:
--yoloBatchSize 128 - 40GB+ VRAM:
--yoloBatchSize 200
The pipeline uses PyAV for fast video decoding (faster than OpenCV).
Skip expensive video visualization when only JSON output is needed:
uv run python main.py --videoName video --videoFolder workdir --metadataOnlyChoose between speed and accuracy:
n(nano): Fastest, good for clear facess(small): Balancedm(medium): Better accuracyl(large): Best accuracy, slower
- GPU utilization low during TalkNet: Use
--useBatchedflag - Out of memory: Reduce
--yoloBatchSizeand--talknetBatchSize - CPU bottleneck: Increase
--nDataLoaderThread(up to CPU core count)
This project builds on the great work from:
- TalkNet-ASD for active speaker detection.
- YOLO-Face for face detection.