Skip to content

Reproduced VBench scores are significantly different from paper results (e.g., Multiple Objects) #33

@Sungwoong-Yune

Description

@Sungwoong-Yune

Thank you for your great work and for sharing the code!

I have been trying to reproduce the results presented in your paper. Since our generated videos are longer than 5 seconds, I used VBench Long for the evaluation. I used the provided rewritten prompts and conducted the tests using randomly sampled seeds.

While most results are reasonable, I noticed a significant discrepancy in several dimensions. Specifically, for the Multiple Objects dimension:

Reported in Paper: 78.66

Our Reproduction: 87.86 (approx. 9.2 points higher)

I am concerned that this gap might stem from differences in the evaluation setup rather than the model performance itself. Could you please clarify a few details regarding your evaluation environment?

If there are any specific evaluation scripts used for the paper results, it would be extremely helpful if you could share them.

Thank you in advance for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions