Questions about training data format and preprocessing / filtering

Hi, thank you for sharing this impressive work and for releasing the code!

I have some questions regarding the training data format and preprocessing pipeline, and I hope you could provide some clarification or guidance.

1. Training sample length and preprocessing

I noticed that in example/trainset_example/fps25_mp4, the provided training samples are relatively short, around 5–8 seconds each.
Are these clips already preprocessed and segmented from longer raw videos? (In the paper, the average video length is reported to be around 150 seconds)

2. Data filtering and quality control

Would it be possible to share data filtering scripts, or at least provide more detailed recommendations? In particular:

**Face size / scale:**
You mentioned that overly large faces may negatively affect lip-sync accuracy (https://github.com/antgroup/ditto-talkinghead/issues/73). Is there a recommended face size range (e.g., relative to image resolution)? Do you filter samples based on face bounding box size?

**Head motion:**
Is there any filtering based on head pose variation (pitch / yaw / roll)? Do you remove samples with excessive head movement or large pose changes?

**Other filtering criteria:**
Do you apply any filtering based on audio–visual sync confidence, face detection stability, image quality, occlusion, or blur?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about training data format and preprocessing / filtering #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about training data format and preprocessing / filtering #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions