Skip to content

Questions about training data format and preprocessing / filtering #78

@undobug

Description

@undobug

Hi, thank you for sharing this impressive work and for releasing the code!

I have some questions regarding the training data format and preprocessing pipeline, and I hope you could provide some clarification or guidance.

  1. Training sample length and preprocessing

I noticed that in example/trainset_example/fps25_mp4, the provided training samples are relatively short, around 5–8 seconds each.
Are these clips already preprocessed and segmented from longer raw videos? (In the paper, the average video length is reported to be around 150 seconds)

  1. Data filtering and quality control

Would it be possible to share data filtering scripts, or at least provide more detailed recommendations? In particular:

Face size / scale:
You mentioned that overly large faces may negatively affect lip-sync accuracy (#73). Is there a recommended face size range (e.g., relative to image resolution)? Do you filter samples based on face bounding box size?

Head motion:
Is there any filtering based on head pose variation (pitch / yaw / roll)? Do you remove samples with excessive head movement or large pose changes?

Other filtering criteria:
Do you apply any filtering based on audio–visual sync confidence, face detection stability, image quality, occlusion, or blur?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions