Skip to content

Long-Reasoning Video-R1 Model with Iterative Policy Optimization

Notifications You must be signed in to change notification settings

HITsz-TMG/VerIPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization

[📄 Paper Link] [🤗 VerIPO-7B-v1.0]

Training Loop Formation Supervision Type Speed Exploration/Path Characteristics
Strongly supervised SFT Cross Entropy (Single) Fastest Single path
Directed optimization supervision (DPO) Pair (Two) Fast Pair-wise
Outcome-based Group optimization (GRPO) Sampling (N) Slower Broad exploration

🔥 News

2025.06.06 🚀 We release the checkpoint and evaluation code (Sec. 4) of VerIPO-7B-v1.0. You can download the checkpoint in 🤗 Huggingface.

1. Overview

Popular Reinforcement Fine-Tuning (RFT) methods, e.g., Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.

To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search (Exploration) and DPO's targeted optimization.

Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct-answer large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

A screenshot of the application Figure 1: Experimental Findings. Figures (A, D): Initial GRPO training with different data types shows only utilizing Video-QA data decreases response length. Figures (B, E): Continual GRPO training with/without Verifier-guided DPO (VerIPO) demonstrates VerIPO improves accuracy and response length. Figure (C): Inconsistency rate (thinking vs. final answer) at different stages reveals our method lowers contextual inconsistency of long CoTs while GRPO increases it. Figure (F): Performance on challenging video reasoning dataset VSI-Bench [81] shows VerIPO (trained with Qwen2.5-VL-7B) outperforms strong LMMs including GPT-4o [23], Video-R1 [18], and Kimi-VL [61].

2. Approach

A screenshot of model arc Figure 2: Overview of VerIPO workflow. This training loop is guided by the Verifier's continuous evaluation and selection of training samples. The optimization process progressively improves the model's long-term reasoning capability by learning from high-quality and informative reasoning examples.

The training loop follows a curriculum learning approach to activate the LMMs' long-term reasoning ability in video gradually. This begins with simple-modality data (text-only or image QA) for initial reasoning activation with GRPO, followed by the GRPO training using image and video QA data, as shown in the following Table.

Table 1: Training data and hyperparameters across different stages.

Stage Reasoning Activation Group-Slow-Search Pair-Fast-Align Group-Slow-Search
Algorithm GRPO GRPO DPO GRPO
Long Document (1k) Science-Image (4K)
Data Math-Text (30k) Spatial-Image (9k)
Reasoning-Image (39K) General-Image (10K) Rollouts of VQA-Video
VQA-Video (24K) VQA-Video
Gloabl Batch Size 128 64 32 64
Rollout Batch Size 64 64 - 64
Learning Rate 1e-6 1e-6 5e-7 5e-7
Rollout Responses Per Query 8 8 - 8
Sampling Temperature 1.0 1.0 - 1.0
DPO Beta ($\beta$) - - 0.1 -

Then, the whole GRPO-Verifier-DPO pipeline continuously enhances the model's long-term reasoning capability and gradually stabilizes its performance on video reasoning, iteratively pushing towards the model's inherent reasoning limit. During the iterative process, we will gradually discard 80% of the simple examples ($r_a^{avg}=1$) from the previous GRPO training process to reduce the training time of models. The entire training process equips LMMs with robust long-chain reasoning ability with slow-search GRPO (Wide Exploration) and fast-align DPO (Targeted Optimization).

3. Getting Started

VerIPO is bulid based on Qwen2.5-VL. Before starting to inference or evaluate, you need to create the environment using the following commands.

git clone https://github.com/HITsz-TMG/VerIPO
conda create -n veripo python=3.10
conda activate veripo

pip install -r requirements.txt
pip install qwen-vl-utils[decord]
pip install flash_attn --no-build-isolation

Inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
     "Uni-MoE/VerIPO-7B-v1.0",
     torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
     device_map="auto",
)

# default processor
processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 128*28*28,
                "max_frames": 128,
                "fps": 2.0
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Evaluation

VerIPO is evaluated on VSI-Bench, VideoMMMU, MMVU, TOMATO, LVBench, and Video-MME.

We have provided the JSON file, and you need to download the videos from the official website

# Example: Evaluation on VSI-Bench
python evaluation/evaluation_vllm.py --model_path Uni-MoE/VerIPO-7B-v1.0 --output_path your_save_json_path --prompt_path evaluation/json/vsi_bench.json --video_dir path_to_video

# Example: Evaluation on TOMATO(specific for fps=4.0)
python evaluation/evaluation_vllm.py --model_path Uni-MoE/VerIPO-7B-v1.0 --output_path your_save_json_path --prompt_path evaluation/json/tomato.json --video_dir path_to_video --video_fps 4.0

We also provide the evaluation script for calculating scores.

python evaluation/calculate_score.py --pred_path your_save_json_path

4. Acknowledgements

We acknowledge the outstanding open-source contributions from OpenRLHF, MM-EUREKA and vLLM. We also extend our gratitude to DeepSeek-R1 and QwenVL for their open-source techniques and base models, which have enabled us to further our exploration.

5. Citations

@article{li2025veripo,
  title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
  author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
  journal={arXiv preprint arXiv:2505.19000},
  year={2025}
}

About

Long-Reasoning Video-R1 Model with Iterative Policy Optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages