Reproduced VBench scores are significantly different from paper results (e.g., Multiple Objects)

Thank you for your great work and for sharing the code!

I have been trying to reproduce the results presented in your paper. Since our generated videos are longer than 5 seconds, I used VBench Long for the evaluation. I used the provided rewritten prompts and conducted the tests using randomly sampled seeds.

While most results are reasonable, I noticed a significant discrepancy in several dimensions. Specifically, for the Multiple Objects dimension:

Reported in Paper: 78.66

Our Reproduction: 87.86 (approx. 9.2 points higher)

I am concerned that this gap might stem from differences in the evaluation setup rather than the model performance itself. Could you please clarify a few details regarding your evaluation environment?

If there are any specific evaluation scripts used for the paper results, it would be extremely helpful if you could share them.

Thank you in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduced VBench scores are significantly different from paper results (e.g., Multiple Objects) #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproduced VBench scores are significantly different from paper results (e.g., Multiple Objects) #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions