Thank you for your great work and for sharing the code!
I have been trying to reproduce the results presented in your paper. Since our generated videos are longer than 5 seconds, I used VBench Long for the evaluation. I used the provided rewritten prompts and conducted the tests using randomly sampled seeds.
While most results are reasonable, I noticed a significant discrepancy in several dimensions. Specifically, for the Multiple Objects dimension:
Reported in Paper: 78.66
Our Reproduction: 87.86 (approx. 9.2 points higher)
I am concerned that this gap might stem from differences in the evaluation setup rather than the model performance itself. Could you please clarify a few details regarding your evaluation environment?
If there are any specific evaluation scripts used for the paper results, it would be extremely helpful if you could share them.
Thank you in advance for your help!