This is our implementation for the paper VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT.
VTG-GPT leverages frozen GPTs to enable zero-shot inference without training.
- Install dependencies
conda create -n vtg-gpt python=3.10
conda activate vtg-gpt
pip install -r requirements.txt- Unzip caption files
cd data/qvhighlights/caption/
unzip val.zip# inference
python infer_qvhighlights.py val
# evaluation
bash standalone_eval/eval.shRun the above code to get:
| Metrics | R1@0.5 | R1@0.7 | mAP@0.5 | mAP@0.75 | mAP@avg |
|---|---|---|---|---|---|
| Values | 59.03 | 38.90 | 56.11 | 35.44 | 35.57 |
cd minigpt
conda create --name minigptv python=3.9
pip install -r requirements.txtpython run_v2.pycd Baichuan2
conda activate vtg-gptpython rephrase_query.pyWe thank Youyao Jia for helpful discussions.
This code is based on Moment-DETR and SeViLA. We used resources from MiniGPT-4, Baichuan2, LLaMa2. We thank the authors for their awesome open-source contributions.
If you find this project useful for your research, please kindly cite our paper.
@inproceedings{xu2025zero,
title={Zero-shot video moment retrieval via off-the-shelf multimodal large language models},
author={Xu, Yifang and Sun, Yunzhuo and Zhai, Benxiang and Li, Ming and Liang, Wenxin and Li, Yang and Du, Sidan},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={9},
pages={8978--8986},
year={2025}
}
@article{xu2024vtg,
title={VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT},
author={Xu, Yifang and Sun, Yunzhuo and Xie, Zien and Zhai, Benxiang and Du, Sidan},
journal={Applied Sciences},
volume={14},
number={5},
pages={1894},
year={2024},
publisher={MDPI}
}
