Active Speaker Detection

A full pipeline to detect and highlight active speakers in videos using YOLO for face detection and TalkNet for speaker detection.

Clone the Repository

git clone https://github.com/MjdMahasneh/active-speaker-detection.git
cd active-speaker-detection

Setup

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Install Dependencies

uv sync

This will create a virtual environment and install all dependencies including PyTorch with CUDA support.

Run

To run the pipeline, modify configurations in ./config/args.py and then run main.py. Alternatively, you can run the script directly with command line arguments.

uv run python main.py --videoName video --videoFolder workdir

For better performance with GPU, increase batch size and data loader threads:

uv run python main.py --videoName video --videoFolder workdir --yoloBatchSize 64 --nDataLoaderThread 16

For faster processing without video visualization (metadata only):

uv run python main.py --videoName video --videoFolder workdir --metadataOnly

To use a larger YOLO model for better accuracy:

uv run python main.py --videoName video --videoFolder workdir --yoloVariant m

Note: video can be in .mp4 or .avi formats.

Output Structure

workdir/
└── video/
    ├── pyavi/                 # extracted audio + output video
    ├── pyframes/              # all video frames (JPEG format)
    ├── pycrop/                # cropped face clips
    └── pywork/
        ├── tracks.pckl        # face tracks
        ├── scores.pckl        # speaking scores
        ├── speaker_summary.json    # summary of speaker activity
        └── frame_metadata.json     # frame-centric metadata

Models

YOLOv11-Face: Face detection (variants: n/s/m/l/x for speed vs accuracy tradeoff)
TalkNet: Audio-visual active speaker detection

Command Line Options

Option	Default	Description
`--videoName`	`video`	Input video name (without extension)
`--videoFolder`	`workdir`	Path for inputs and outputs
`--yoloVariant`	`n`	YOLO variant: n (nano), s (small), m (medium), l (large), x (extra-large)
`--yoloBatchSize`	`32`	Batch size for face detection (increase for better GPU utilization)
`--nDataLoaderThread`	`10`	Number of parallel workers for video cropping
`--speakerThresh`	`0.6`	Speaker detection confidence threshold
`--minSpeechLen`	`0.25`	Minimum speech duration (seconds) to count as speaking
`--ignoreMultiSpeakers`	`False`	Skip frames with multiple speakers in visualization
`--jpegQscale`	`2`	JPEG quality scale for extracted frames (1-31, lower=better)
`--metadataOnly`	`False`	Skip video visualization, only produce JSON metadata
`--useBatched`	`False`	Use batched TalkNet inference (2-3x faster)
`--talknetBatchSize`	`16`	Batch size for TalkNet when using `--useBatched`

Components

Scene detection via PySceneDetect
Face detection via YOLO (batched inference)
Face tracking via IOU + interpolation
Speech classification via TalkNet
Visualization with speaking durations

Performance Optimization Guide

Quick Start for Best Performance

For maximum speed on a GPU with 24GB+ VRAM (e.g., L4, A100):

uv run python main.py --videoName video --videoFolder workdir \
    --yoloBatchSize 128 \
    --useBatched \
    --talknetBatchSize 32 \
    --metadataOnly

Optimization Tiers

Tier	Speedup	Command Additions	Best For
Baseline	1x	(none)	Debugging, small videos
Fast	2-3x	`--useBatched`	General use
Faster	4-6x	`--useBatched --yoloBatchSize 64`	Production
Maximum	6-10x	`--useBatched --yoloBatchSize 128 --metadataOnly`	Batch processing

Key Optimizations

1. Batched TalkNet Inference (`--useBatched`)

Processes multiple face tracks in parallel instead of sequentially. Recommended for all use cases.

uv run python main.py --videoName video --videoFolder workdir --useBatched

2. Increase YOLO Batch Size (`--yoloBatchSize`)

Higher batch sizes improve GPU utilization for face detection:

16GB VRAM: --yoloBatchSize 64
24GB VRAM: --yoloBatchSize 128
40GB+ VRAM: --yoloBatchSize 200

3. Fast Video Loading

The pipeline uses PyAV for fast video decoding (faster than OpenCV).

4. Metadata-Only Mode (`--metadataOnly`)

Skip expensive video visualization when only JSON output is needed:

uv run python main.py --videoName video --videoFolder workdir --metadataOnly

5. YOLO Variant Selection (`--yoloVariant`)

Choose between speed and accuracy:

n (nano): Fastest, good for clear faces
s (small): Balanced
m (medium): Better accuracy
l (large): Best accuracy, slower

Troubleshooting Performance

GPU utilization low during TalkNet: Use --useBatched flag
Out of memory: Reduce --yoloBatchSize and --talknetBatchSize
CPU bottleneck: Increase --nDataLoaderThread (up to CPU core count)

Acknowledgements

This project builds on the great work from:

TalkNet-ASD for active speaker detection.
YOLO-Face for face detection.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
config		config
model		model
utils		utils
weights/talknet		weights/talknet
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active Speaker Detection

Clone the Repository

Setup

1. Install uv

2. Install Dependencies

Run

Output Structure

Models

Command Line Options

Components

Performance Optimization Guide

Quick Start for Best Performance

Optimization Tiers

Key Optimizations

1. Batched TalkNet Inference (`--useBatched`)

2. Increase YOLO Batch Size (`--yoloBatchSize`)

3. Fast Video Loading

4. Metadata-Only Mode (`--metadataOnly`)

5. YOLO Variant Selection (`--yoloVariant`)

Troubleshooting Performance

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

tqtensor/active-speaker-detection

Folders and files

Latest commit

History

Repository files navigation

Active Speaker Detection

Clone the Repository

Setup

1. Install uv

2. Install Dependencies

Run

Output Structure

Models

Command Line Options

Components

Performance Optimization Guide

Quick Start for Best Performance

Optimization Tiers

Key Optimizations

1. Batched TalkNet Inference (--useBatched)

2. Increase YOLO Batch Size (--yoloBatchSize)

3. Fast Video Loading

4. Metadata-Only Mode (--metadataOnly)

5. YOLO Variant Selection (--yoloVariant)

Troubleshooting Performance

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Batched TalkNet Inference (`--useBatched`)

2. Increase YOLO Batch Size (`--yoloBatchSize`)

4. Metadata-Only Mode (`--metadataOnly`)

5. YOLO Variant Selection (`--yoloVariant`)

Packages