Skip to content

A curated list of reinforcement learning with verifiable rewards (continually updated)

License

Notifications You must be signed in to change notification settings

opendilab/awesome-RLVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome RLVR — Reinforcement Learning with Verifiable Rewards

Stars Forks Contributors License

A curated collection of surveys, tutorials, codebases and papers on
Reinforcement Learning with Verifiable Rewards (RLVR)
a rapidly emerging paradigm that aligns both LLMs and other agents through
objective, externally verifiable signals.


An overview of how Reinforcement Learning with Verifiable Rewards (RLVR) works. (Figure taken from “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”)

Why RLVR?

RLVR couples reinforcement learning with objective, externally verifiable signals, yielding a training paradigm that is simultaneously powerful and trustworthy:

  • Ground-truth rewards – unit tests, formal proofs, or fact-checkers provide binary, tamper-proof feedback.
  • Intrinsic safety & auditability – every reward can be traced back to a transparent verifier run, simplifying debugging and compliance.
  • Strong generalization – models trained on verifiable objectives tend to extrapolate to unseen tasks with minimal extra data.
  • Emergent “aha-moments” – sparse, high-precision rewards encourage systematic exploration that often yields sudden surges in capability when the correct strategy is discovered.
  • Self-bootstrapping improvement – the agent can iteratively refine or even generate new verifiers, compounding its own learning signal.
  • Domain-agnostic applicability – the same recipe works for code generation, theorem proving, robotics, games, and more.

How does it work?

  1. Sampling. We draw one or more candidate completions ( {a}{1..k} ) from a policy model ( \pi\theta ) given a prompt ( s ).
  2. Verification. A deterministic function ( r(s,{a}) ) checks each completion for correctness.
  3. Rewarding.
    • If a completion is verifiably correct, it receives a reward ( r = \gamma ).
    • Otherwise the reward is ( r = 0 ).
  4. Policy update. Using the rewards, we update the policy parameters via RL (e.g., PPO).
  5. (Optional) Verifier refinement. The verifier itself can be trained, hardened, or expanded to cover new edge cases.

Through repeated iterations of this loop, the policy learns to maximise the externally verifiable reward while maintaining a clear audit trail for every decision it makes.


Pull requests are welcome 🎉 — see Contributing for guidelines.

[2025-07-03] New! Initial public release of Awesome-RLVR 🎉

Table of Contents

format:
- [title](paper link) (presentation type)
  - main authors or main affiliations
  - Key: key problems and insights
  - ExpEnv: experiment environments

Surveys & Tutorials

Click to expand / collapse

Codebases

Click to expand / collapse
Project Stars Description
open-r1 Stars Fully open reproduction of the DeepSeek-R1 pipeline (SFT, distillation, GRPO, evaluation)
OpenRLHF Stars An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)*
verl Stars a flexible, efficient and production-ready RL training library for large language models
TinyZero Stars Minimal reproduction of DeepSeek R1-Zero
AReaL Stars Ant Reasoning Reinforcement Learning for LLMs
Open-Reasoner-Zero Stars one open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility
ROLL Stars an Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
slime Stars an LLM post-training framework for RL scaling with high-performance training and flexible data generation
RAGEN Stars RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train LLM reasoning agents in interactive, stochastic environments.
PRIME Stars PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards
rllm Stars an open-source framework for post-training language agents via reinforcement learning
Nemo-Aligner Stars Scalable toolkit for efficient model alignment
Trinity-RFT Stars A unified RFT framework with plug-and-play modules (for algorithms, data pipelines, and synchronization)

Papers

2025

NeurIPS 2025

Click to expand / collapse
  • ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

    • Xiyao Wang, Zhengyuan Yang, Chao Feng, Yuhang Zhou, Xiaoyu Liu, Yongyuan Liang, Ming Li, Ziyi Zang, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
    • Key: Visual reasoning, Vision-Language Model, Visual captioning, Reward Model, Visual Hallucination, fine-grained hallucination criticism as RL objective
    • ExpEnv: ViCrit-Bench, natural-image reasoning, abstract image reasoning, visual math
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    • Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
    • Key: RLVR capability boundaries, pass@k analysis, base model vs RLVR-trained models, reasoning pattern emergence, distillation vs RL
    • ExpEnv: Math/coding/visual reasoning benchmarks, multiple model families and RL algorithms
  • Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    • Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
    • Key: Self-play reasoning, zero external data, self-proposed tasks, curriculum learning, code executor as unified feedback
    • ExpEnv: Coding and mathematical reasoning tasks, cross-model-scale experiments
  • CURE: Co-Evolving Coders and Unit Testers via Reinforcement Learning

    • Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
    • Key: Co-evolution of code generation and unit test generation, interaction-based rewards without ground-truth code, test-time scaling, agentic unit test generation
    • ExpEnv: SWE-bench Verified, unit test generation benchmarks, 4B model achieving 64.8% inference efficiency
  • Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

    • Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Jiaze Chen, Xuefeng Li, Qiying Yu, Hao Zhou, Mingxuan Wang
    • Key: Puzzle reasoning, generator-verifier design, 36 tasks across 7 categories, multi-task RLVR, 418K competition-level problems
    • ExpEnv: ENIGMATA-Eval, ARC-AGI (32.8%), ARC-AGI 2 (0.6%), AIME (2024-2025), BeyondAIME, GPQA (Diamond)
  • To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuning

    • Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, Kaipeng Zhang
    • Key: Thinking vs No-Thinking RFT, visual perception tasks, overthinking in MLLMs, equality accuracy reward, Adaptive-Thinking method
    • ExpEnv: Six diverse visual tasks across different model sizes and types, image classification benchmarks
  • Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

    • Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, Dong Yu
    • Key: Self-verification in RLVR, RISE framework, simultaneous training of problem-solving and self-verification, online RL for both tasks
    • ExpEnv: Mathematical reasoning benchmarks, verification compute analysis
  • Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

    • Yunhao Tang, Sid Wang, Lovish Madaan, Remi Munos
    • Key: JEPO algorithm, Jensen's evidence lower bound, unverifiable data, chain-of-thought as latent variable, extending RLVR to semi-verifiable data
    • ExpEnv: Math (verifiable), numina and numina-proof (semi-verifiable/unverifiable), test set likelihood evaluation
  • RLVR-World: Training World Models with Reinforcement Learning

    • Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long
    • Key: World models with RLVR, task-specific optimization beyond MLE, transition prediction metrics as rewards, autoregressive tokenized sequences
    • ExpEnv: Text games, web navigation, robot manipulation, language-based and video-based world models
  • SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

    • Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao
    • Key: Self-instruction generation, self-rewarding via majority-voting, bootstrapping with limited initial data, online filtering strategies
    • ExpEnv: Various reasoning benchmarks across different LLM backbones
  • Learning to Reason under Off-Policy Guidance

    • Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
    • Key: LUFFY framework, off-policy reasoning traces, Mixed-Policy GRPO, policy shaping via regularized importance sampling, learning beyond initial capabilities
    • ExpEnv: Six math benchmarks, out-of-distribution tasks, weak model training scenarios
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    • Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida Wang
    • Key: Software engineering with RL, lightweight rule-based reward, open-source software evolution data, Llama3-SWE-RL-70B achieving 41.0% on SWE-bench Verified
    • ExpEnv: SWE-bench Verified, function coding, library use, code reasoning, mathematics, general language understanding
  • rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

    • Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Cheng Li, Mao Yang
    • Key: 418K competition-level code problems, 580K long-reasoning solutions, input-output test case synthesis pipeline, mutual verification mechanism
    • ExpEnv: LiveCodeBench (57.3% for 7B), USA Computing Olympiad (16.15% avg for 7B, outperforming QWQ-32B)
  • SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

    • Junteng Liu, Yuanxiang Fan, Jiang Zhuo, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Mozhi Zhang, Pengyu Zhao, Junxian He
    • Key: Logical reasoning as foundation for general reasoning, 35 diverse logical reasoning tasks, controlled synthesis with adjustable difficulty, mixing with math/coding tasks
    • ExpEnv: BBEH (outperforming DeepSeek-R1-Distill-Qwen-32B by 6 points), state-of-the-art logical reasoning performance
  • SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

    • Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen
    • Key: Self-aware weakness identification, problem synthesis targeting model deficiencies, core concept extraction from failure cases, weakness-driven augmentation
    • ExpEnv: Eight mainstream reasoning benchmarks, 10% gain on 7B models, 7.7% gain on 32B models
  • AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

    • Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
    • Key: Math-only then code-only RL training sequence, curriculum learning with progressive response lengths, robust data curation for challenging prompts, on-policy parameter updates
    • ExpEnv: AIME 2025 (+14.6%/+17.2% for 7B/14B), LiveCodeBench (+6.8%/+5.8% for 7B/14B)
  • Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

    • Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, Wenqiang Zhang
    • Key: ZeroTIR (Tool-Integrated Reasoning), spontaneous code execution without supervised tool-use examples, RL scaling laws for tool use, predictable metrics scaling
    • ExpEnv: Math benchmarks, standard RL algorithms comparison
  • Rethinking Verification for LLM Code Generation: From Generation to Testing

    • Zihan Ma, Taolin Zhang, Maosongcao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
    • Key: Test-case generation (TCG) task, multi-dimensional thoroughness metrics, human-LLM collaborative method (SAGA), TCGBench benchmark
    • ExpEnv: TCGBench (90.62% detection rate), LiveCodeBench-v6 (10.78% higher Verifier Acc), test suite quality evaluation
  • ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    • Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong
    • Key: Prolonged RL training, KL divergence control, reference policy resetting, diverse task suite, novel reasoning strategies inaccessible to base models
    • ExpEnv: Wide range of pass@k evaluations, base model competence vs training duration analysis
  • Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

    • Honglin Lin, Qizhi Pei, Zhuoshi Pan, Yu Li, Xin Gao, Juntao Li, Conghui He, Lijun Wu
    • Key: Caco framework, code-assisted CoT, automated validation via code execution, reverse-engineering to natural language instructions, Caco-1.3M dataset
    • ExpEnv: Mathematical reasoning benchmarks, code-anchored verification, instruction diversity analysis
  • The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

    • Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
    • Key: Negative Sample Reinforcement (NSR), training with only negative samples, Pass@k across entire spectrum, gradient analysis, upweighting NSR
    • ExpEnv: MATH, AIME 2025, AMC23, Qwen2.5-Math-7B, Qwen3-4B, Llama-3.1-8B-Instruct
  • Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

    • Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, Yinyu Ye
    • Key: Optimization problem formulation, Solver-Informed RL (SIRL), executable code and .lp file assessment, instance-level mathematical model verification, instance-enhanced self-consistency
    • ExpEnv: Diverse public optimization benchmarks, surpassing DeepSeek-V3 and OpenAI-o3
  • ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

    • Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo
    • Key: Autoformalization, Lean 4, expert iteration with knowledge distillation, structural augmentation strategies, 117k undergraduate-level theorem statements
    • ExpEnv: All benchmarks (p<0.05, two-sided t-test), outperforming Herald Translator and Kimina-Autoformalizer
  • QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

    • Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
    • Key: Verilog generation with RLVR, rule-based testbench generator, round-trip data synthesis, distill-then-RL pipeline, adaptive DAPO algorithm
    • ExpEnv: VerilogEval v2 (68.6% pass@1), RTLLM v1.1 (72.9% pass@1), surpassing 671B DeepSeek-R1 on RTLLM
  • QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

    • Yang Zhang, Rui Zhang, Jiaming Guo, Huang Lei, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
    • Key: Signal-level optimization, verified signal-aware implementations extraction, Abstract Syntax Tree (AST) for code segments, signal-aware DPO
    • ExpEnv: VerilogEval, RTLLM, 7B matching DeepSeek v3 671B performance
  • miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

    • Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
    • Key: Formal reasoning, automated theorem proving, Lean prover, miniF2F benchmark analysis, discrepancy correction, miniF2F-v2 with verified statements
    • ExpEnv: miniF2F original vs miniF2F-v2, full theorem proving pipeline evaluation (70% on v2 vs 40% on original)
  • SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

    • Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, Ziwei Tang, Yuanshuai Li, Florensia Widjaja, Xintong Zhu, Feige Zhou, Yongfeng Huang, Yannis Papakonstantinou, Fatma Ozcan, Chenhao Ma, Reynold Cheng
    • Key: SQL issue debugging, BIRD-CRITIC benchmark (530 PostgreSQL + 570 multi-dialect tasks), SQL-Rewind strategy, f-Plan Boosting, BIRD-Fixer agent
    • ExpEnv: BIRD-CRITIC-PG (38.11% for 14B), BIRD-CRITIC-Multi (29.65% for 14B), surpassing Claude-3.7-Sonnet and GPT-4.1

ICML 2025

Click to expand / collapse

Other 2025 Papers

Click to expand / collapse

2024 & Earlier

Click to expand / collapse

Other Awesome Lists

  1. Fork this repo.
  2. Add a paper/tool entry under the correct section (keep reverse-chronological order, follow the three-line format).
  3. Open a Pull Request and briefly describe your changes.

License

Awesome-RLVR © 2025 OpenDILab & Contributors Apache 2.0 License

About

A curated list of reinforcement learning with verifiable rewards (continually updated)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5