-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Hi, thank you for sharing this impressive work and for releasing the code!
I have some questions regarding the training data format and preprocessing pipeline, and I hope you could provide some clarification or guidance.
- Training sample length and preprocessing
I noticed that in example/trainset_example/fps25_mp4, the provided training samples are relatively short, around 5–8 seconds each.
Are these clips already preprocessed and segmented from longer raw videos? (In the paper, the average video length is reported to be around 150 seconds)
- Data filtering and quality control
Would it be possible to share data filtering scripts, or at least provide more detailed recommendations? In particular:
Face size / scale:
You mentioned that overly large faces may negatively affect lip-sync accuracy (#73). Is there a recommended face size range (e.g., relative to image resolution)? Do you filter samples based on face bounding box size?
Head motion:
Is there any filtering based on head pose variation (pitch / yaw / roll)? Do you remove samples with excessive head movement or large pose changes?
Other filtering criteria:
Do you apply any filtering based on audio–visual sync confidence, face detection stability, image quality, occlusion, or blur?