[📄 Paper Link] [🤗 VerIPO-7B-v1.0]
| Training Loop Formation | Supervision Type | Speed | Exploration/Path Characteristics |
|---|---|---|---|
| Strongly supervised SFT | Cross Entropy (Single) | Fastest | Single path |
| Directed optimization supervision (DPO) | Pair (Two) | Fast | Pair-wise |
| Outcome-based Group optimization (GRPO) | Sampling (N) | Slower | Broad exploration |
2025.06.06 🚀 We release the checkpoint and evaluation code (Sec. 4) of VerIPO-7B-v1.0. You can download the checkpoint in 🤗 Huggingface.
Popular Reinforcement Fine-Tuning (RFT) methods, e.g., Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.
To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search (Exploration) and DPO's targeted optimization.
Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct-answer large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
Figure 1: Experimental Findings. Figures (A, D): Initial GRPO training with different data types shows only utilizing Video-QA data decreases response length. Figures (B, E): Continual GRPO training with/without Verifier-guided DPO (VerIPO) demonstrates VerIPO improves accuracy and response length. Figure (C): Inconsistency rate (thinking vs. final answer) at different stages reveals our method lowers contextual inconsistency of long CoTs while GRPO increases it. Figure (F): Performance on challenging video reasoning dataset VSI-Bench [81] shows VerIPO (trained with Qwen2.5-VL-7B) outperforms strong LMMs including GPT-4o [23], Video-R1 [18], and Kimi-VL [61].
Figure 2: Overview of VerIPO workflow. This training loop is guided by the Verifier's continuous evaluation and selection of training samples. The optimization process progressively improves the model's long-term reasoning capability by learning from high-quality and informative reasoning examples.
The training loop follows a curriculum learning approach to activate the LMMs' long-term reasoning ability in video gradually. This begins with simple-modality data (text-only or image QA) for initial reasoning activation with GRPO, followed by the GRPO training using image and video QA data, as shown in the following Table.
Table 1: Training data and hyperparameters across different stages.
| Stage | Reasoning Activation | Group-Slow-Search | Pair-Fast-Align | Group-Slow-Search |
|---|---|---|---|---|
| Algorithm | GRPO | GRPO | DPO | GRPO |
| Long Document (1k) | Science-Image (4K) | |||
| Data | Math-Text (30k) | Spatial-Image (9k) | ||
| Reasoning-Image (39K) | General-Image (10K) | Rollouts of | VQA-Video | |
| VQA-Video (24K) | VQA-Video | |||
| Gloabl Batch Size | 128 | 64 | 32 | 64 |
| Rollout Batch Size | 64 | 64 | - | 64 |
| Learning Rate | 1e-6 | 1e-6 | 5e-7 | 5e-7 |
| Rollout Responses Per Query | 8 | 8 | - | 8 |
| Sampling Temperature | 1.0 | 1.0 | - | 1.0 |
| DPO Beta ( |
- | - | 0.1 | - |
Then, the whole GRPO-Verifier-DPO pipeline continuously enhances the model's long-term reasoning capability and gradually stabilizes its performance on video reasoning, iteratively pushing towards the model's inherent reasoning limit. During the iterative process, we will gradually discard 80% of the simple examples (
VerIPO is bulid based on Qwen2.5-VL. Before starting to inference or evaluate, you need to create the environment using the following commands.
git clone https://github.com/HITsz-TMG/VerIPO
conda create -n veripo python=3.10
conda activate veripo
pip install -r requirements.txt
pip install qwen-vl-utils[decord]
pip install flash_attn --no-build-isolationfrom transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Uni-MoE/VerIPO-7B-v1.0",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
# default processor
processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 128*28*28,
"max_frames": 128,
"fps": 2.0
},
{"type": "text", "text": "Describe this video."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)VerIPO is evaluated on VSI-Bench, VideoMMMU, MMVU, TOMATO, LVBench, and Video-MME.
We have provided the JSON file, and you need to download the videos from the official website
# Example: Evaluation on VSI-Bench
python evaluation/evaluation_vllm.py --model_path Uni-MoE/VerIPO-7B-v1.0 --output_path your_save_json_path --prompt_path evaluation/json/vsi_bench.json --video_dir path_to_video
# Example: Evaluation on TOMATO(specific for fps=4.0)
python evaluation/evaluation_vllm.py --model_path Uni-MoE/VerIPO-7B-v1.0 --output_path your_save_json_path --prompt_path evaluation/json/tomato.json --video_dir path_to_video --video_fps 4.0We also provide the evaluation script for calculating scores.
python evaluation/calculate_score.py --pred_path your_save_json_pathWe acknowledge the outstanding open-source contributions from OpenRLHF, MM-EUREKA and vLLM. We also extend our gratitude to DeepSeek-R1 and QwenVL for their open-source techniques and base models, which have enabled us to further our exploration.
@article{li2025veripo,
title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
journal={arXiv preprint arXiv:2505.19000},
year={2025}
}