Video generation models can both create and evaluate — we enable 14B models to complete full 720P×81-frame post-training within 67GB VRAM, achieving 1.5× faster speed and 56% improvement in motion quality over traditional methods.
HY-Video-PRFL: Video Generation Models Are Good Latent Reward Models
- Dec 07, 2025: 👋 We release the training and inference code of HY-Video-PRFL.
- Nov 26, 2025: 👋 We release the paper and project page. [Paper] [Project Page]
- HY-Video-PRFL
- Training and inference code for PAVRM
- Training and inference code for PRFL
- 🔥🔥🔥 News!!
- 📑 Open-source Plan
- 📖 Abstract
- 🏗️ Model Architecture
- 📊 Performance
- 🎬 Case Show
- 📜 Requirements
- 🛠️ Installation
- 🧱 Download Models
- 🎓 Training
- 🚀 Inference
- 📝 Citation
- 🙏 Acknowledgements
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding.
HY-Video-PRFL introduces Process Reward Feedback Learning (PRFL), a framework that conducts preference optimization entirely in latent space. We demonstrate that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.
Key advantages:
- ✅ Efficient latent-space optimization
- ✅ Significant memory savings
- ✅ 1.4X faster training compared to RGB ReFL
- ✅ Better alignment with human preferences
Traditional RGB ReFL relies on vision-language models designed for pixel-space inputs, requiring expensive VAE decoding and confining optimization to late-stage denoising steps.
Our PRFL approach leverages pre-trained video generation models as reward models in the noisy latent space. This enables:
- Full-chain gradient backpropagation without VAE decoding
- Early-stage supervision for motion dynamics and structure coherence
- Substantial reductions in memory consumption and training time
Our experiments demonstrate that PRFL achieves substantial motion quality improvements (with +56.00 in dynamic degree, +21.52 in human anatomy and superior alignment with human preferences) as well as significant efficiency gains (with at least 1.4X faster training and notable memory savings).
| 480P Resolution | 720P Resolution |
|---|---|
85216a8012f4e8bf85fb219bc8453774_seed_579180.mp4📋 Show promptTwo shirtless men with short dark hair are sparring in a dimly lit room. They are both wearing boxing gloves, one red and one black. One man is wearing white shorts while the other is wearing black shorts. There are several screens on the wall displaying images of buildings and people. |
e2e6dff4025869caa55f4baaa8cd5208_seed_634237.mp4📋 Show promptA woman with fair skin, dark hair tied back, and wearing a light green t-shirt is visible against a gray background. She uses both hands to apply a white substance from below her eyes upward onto her face. Her mouth is slightly open as she spreads the cream. |
490d22e34d0fdf1aba9e1ba9e20d667d_seed_218637.mp4📋 Show promptThe woman has dark eyes and is holding a black smartphone to her ear with her right hand. She is typing on the keyboard of an open silver laptop computer with her left hand. Her fingers have blue nail polish. She is sitting in front of a window covered by sheer white curtains. |
76dfec95b2b525b10014e6e43612f1d0_seed_588308.mp4📋 Show promptA light-skinned man with short hair wearing a yellow baseball cap, plaid shirt, and blue overalls stands in a field of sunflowers. He holds a cut sunflower head in his left hand and touches it with his right index finger. Several other sunflowers are visible in the background, some facing away from the camera. |
We recommend using GPUs with at least 80GB of memory for better generation quality.
- OS: Linux
- CUDA: 12.4
git clone https://github.com/Tencent-Hunyuan/HY-Video-PRFL.git
cd HY-Video-PRFLWe recommend CUDA versions 12.4 for installation. Conda's installation instructions are available here.
# Create conda environment
conda create -n HY-Video-PRFL python==3.10
# Activate environment
conda activate HY-Video-PRFL
# Install PyTorch and dependencies (CUDA 12.4)
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
# Install additional dependencies
pip3 install git+https://github.com/huggingface/transformers qwen-vl-utils[decord]
pip3 install git+https://github.com/huggingface/diffusers
pip3 install xfuser -i https://pypi.org/simple
pip3 install flash-attn==2.5.0 --no-build-isolation
pip3 install -e .
pip3 install nvidia-cublas-cu12==12.4.5.8
export PYTHONPATH=./Download the pretrained models before training or inference:
| Model | Resolution | Download Links | Notes |
|---|---|---|---|
| Wan2.1-T2V-14B | 480P & 720P | 🤗 Huggingface 🤖 ModelScope |
Text-to-Video model |
| Wan2.1-I2V-14B-720P | 720P | 🤗 Huggingface 🤖 ModelScope |
Image-to-Video (High-res) |
| Wan2.1-I2V-14B-480P | 480P | 🤗 Huggingface 🤖 ModelScope |
Image-to-Video (Standard) |
First, make sure you have installed the huggingface CLI or modelscope CLI.
pip install -U "huggingface_hub[cli]"
pip install modelscope
Then, download the pretrained DiT and VAE checkpoints. For example, you can use the following command to download the WAN2.1 checkpoint of 720P I2V task to ./weights by default.
hf download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./weights/Wan2.1-I2V-14B-720P
python3 scripts/preprocess/gen_wanx_latent.py --config configs/pre_480.yamlWe provide several videos in temp_data/videos as template training data and an input json file temp_data/temp_input_data.jsontemplate for preprocess. configs/pre_480.yaml is for 480P latent extraction and configs/pre_720.yaml is for 720P. The json_path and save_dir in config file can be customized with your own training data.
The annotation for reward model (e.g. "physics_quality": 1, "human_quality": 1) should be added in the data meta files (e.g. temp_data/480/meta_v1/0004e625d5bcb80130e1ea3d204e2488_meta_v1.json). Thus we get meta file list temp_data/temp_data_480.list and temp_data/temp_data_720.list which can be used in PAVRM and PRFL training.
For example, to train PAVRM with 8 GPUs, you can use the following command.
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/pavrm/train_pavrm.py --config configs/train_pavrm_i2v_720.yamlThe meta_file_list and val_meta_file_list in config file can be customized with your own training and validation data. We provide several config files for different settings t2v or i2v, 480P or 720P. To be noted that, we train PAVRM with ce loss. To train PAVRM with bt loss, you can use the config file of configs/train_pavrm_bt_i2v_720.yaml.
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/prfl/train_prfl.py --config configs/train_prfl_i2v_720.yamlThe meta_file_list in config file can be customized with your own training data, lrm_transformer_path, lrm_mlp_path and lrm_query_attention_path in config file are for your reward model obtained from the previous step. We provide several config files for different settings t2v or i2v, 480P or 720P.
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/pavrm/inference_pavrm.py --config configs/infer_pavrm_i2v_720.yamlThe val_meta_file_list in config file can be customized with your own inference data, resume_transformer_path, resume_mlp_path and resume_query_attention_path in config file are for your reward model to be tested.
The PRFL Inference is exactly same as its base model (e.g. Wan2.1).
export negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/prfl/inference_prfl.py \
--dit_fsdp \
--t5_fsdp \
--ulysses_size 1 \
--task "i2v-14B"\
--ckpt_dir "weights/Wan2.1-I2V-14B-720P" \
--lora_path "" \
--lora_alpha 0 \
--dataset_path "temp_data/temp_prfl_infer_data.json" \
--negative_prompt "$negative_prompt" \
--size "1280*720" \
--frame_num 81 \
--sample_steps 40 \
--sample_guide_scale 5.0 \
--sample_shift 5.0 \
--teacache_thresh 0 \
--save_folder outputs/infer/prfl_i2v_720 \
--transformer_path <YOUR_CKPT_PATH> \
--offload_model FalseParameters:
--dit_fsdp--t5_fsdp: Enable FSDP for memory efficiency--task: "t2v-14B" or "i2v-14B"--ckpt_dir: Path to pretrained checkpoint file--lora_path--lora_alpha: Path and load weight ratio for LoRA checkpoint file--dataset_path: Path to inference dataset file--size: Output resolution ("1280*720" or "832*480")--frame_num: Number of frames to generate (default: 81)--sample_steps: Number of inference steps (default: 40)--sample_guide_scale: Classifier-free guidance scale (default: 5.0)--sample_shift: Flow shift (default: 5.0)--save_folder: Path to save generated videos--teacache_thresh: Enable teacache--transformer_path: Path to your PRFL checkpoint file--offload_model: Offload to CPU to save GPU memory
If you find HY-Video-PRFL useful for your research, please cite:
@article{mi2025video,
title={Video Generation Models are Good Latent Reward Models},
author={Mi, Xiaoyue and Yu, Wenqing and Lian, Jiesong and Jie, Shibo and Zhong, Ruizhe and Liu, Zijun and Zhang, Guozhen and Zhou, Zixiang and Xu, Zhiyong and Zhou, Yuan and Lu, Qinglin and Tang, Fan},
journal={arXiv preprint arXiv:2511.21541},
year={2025}
}We sincerely thank the contributors to the following projects:
Star ⭐ this repo if you find it helpful!








