Harbin Institute of Technology, Shenzhen
If you find this work useful for your research, please cite our paper:
@article{li2025perception,
title={Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models},
author={Li, Yunxin and Liu, Zhenyu and Li, Zitao and Zhang, Xuanyu and Xu, Zhenran and Chen, Xinyu and Shi, Haoyuan and Jiang, Shenyuan and Wang, Xintong and Wang, Jifang and Huang, Shouzheng and Zhao, Xinping and Jiang, Borui and Hong, Lanqing and Wang, Longyue and Tian, Zhuotao and Huai, Baoxing and Luo, Wenhan and Luo, Weihua and Zhang, Zheng and Hu, Baotian and Zhang, Min},
journal={arXiv preprint arXiv:2505.04921},
year={2025}
}🔥 Latest 👉 [2025/08/02] We updated the recommended works about multimodal reasoning in our repo and paper. You are welcome to recommend your work to us.
🔥 Latest 👉 [2025/07/06] We collected recent multimodal reasoning models and benchmarks (about 150 papers in 2025.05~2025.06) in our new version and repository. You are welcome to recommend your work to us.
🔥 Latest Updates (Click to See More News)
[2025/05/20] 🏮 We have updated some uncovered works (issues) in the following subsections and papers, continuously introducing newest works. You are welcome to recommend your work to us.
[2025/05/11] 🏮 Exciting news! Our survey was quickly highlighted as the first paper for May 2025 on Hugging Face Daily Papers. Check it out: https://huggingface.co/papers/2505.04921.
[2025/05/09] 🏮 We've analyzed 550+ papers charting the rise of Large Multimodal Reasoning Models (LMRMs). Discover the 4-stage journey from basic modules to advanced MCoT/RL, envisioning Native LMRMs (e.g., capability scope and level, technical prospect) for comprehensive perception, precise understanding, deep reasoning and planning.
✨ Advances on multimodal reasoning models and a collection of related datasets and benchmarks ✨
Figure 1: The core evolving path of large multimodal reasoning models.
Figure 2: The roadmap of large multimodal reasoning models.
Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities—aiming to achieve comprehensive perception, precise understanding, and deep reasoning.
As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior.
We present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities.
First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion.
Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains.
Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI-O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
Click to expand Modular Reasoning Networks table
| Model | Year | Architecture | Highlight | Training Method |
|---|---|---|---|---|
| NMN | 2016 | Modular | Dynamically assembles task-specific modules for visual-textual reasoning. | Supervised learning |
| HieCoAtt | 2016 | Attention-based | Aligns question semantics with image regions via hierarchical cross-modal attention. | Supervised learning |
| MCB | 2016 | Bilinear | Optimizes cross-modal feature interactions with efficient bilinear modules. | Supervised learning |
| SANs | 2016 | Attention-based | Iteratively refines reasoning through multiple attention hops over visual features. | Supervised learning |
| DMN | 2016 | Memory-based | Integrates memory modules for multi-episode reasoning over sequential inputs. | Supervised learning |
| ReasonNet | 2017 | Modular | Decomposes reasoning into entity-relation modules for structured inference. | Supervised learning |
| UpDn | 2018 | Attention-based | Combines bottom-up and top-down attention for object-level reasoning. | Supervised learning |
| MAC | 2018 | Memory-based | Uses a memory-augmented control unit for iterative compositional reasoning. | Supervised learning |
| BAN | 2018 | Bilinear | Captures high-order interactions via bilinear attention across modalities. | Supervised learning |
| HeteroMemory | 2019 | Memory-based | Synchronizes appearance and motion modules for video-based temporal reasoning. | Supervised learning |
| MuRel | 2019 | Relational | Models reasoning as a relational network over object pairs for fine-grained inference. | Supervised learning |
| MCAN | 2019 | Attention-based | Employs modular co-attention with self- and guided-attention for deep reasoning. | Supervised learning |
Click to expand Vision-Language Models table
| Model | Year | Architecture | Highlight | Training Method |
|---|---|---|---|---|
| ViLBERT | 2019 | Dual-Encoder | Aligns visual-text features via dual-stream Transformers with cross-modal attention. | Pretraining + fine-tuning |
| LXMERT | 2019 | Dual-Encoder | Enhances cross-modal reasoning with dual-stream pretraining on diverse tasks. | Pretraining + fine-tuning |
| X-LXMERT | 2020 | Dual-Encoder | Extends dual-stream reasoning with generative cross-modal pretraining. | Pretraining + fine-tuning |
| ALBEF | 2021 | Dual-Encoder | Integrates contrastive learning with momentum distillation for robust reasoning. | Contrastive + generative pretraining |
| SimVLM | 2021 | Dual-Encoder | Uses prefix-based pretraining for flexible cross-modal reasoning. | Pretraining + fine-tuning |
| VLMo | 2022 | Dual-Encoder | Employs a mixture-of-modality-experts for dynamic cross-modal reasoning. | Pretraining + fine-tuning |
| METER | 2022 | Dual-Encoder | Enhances reasoning with a modular encoder-decoder for robust alignment. | Pretraining + fine-tuning |
| BLIP | 2022 | Dual-Encoder | Bootstraps alignment with contrastive learning for efficient reasoning. | Contrastive + generative pretraining |
| VisualBERT | 2019 | Single-Transformer-Backbone | Fuses visual-text inputs in a single Transformer for joint contextual reasoning. | Pretraining + fine-tuning |
| VL-BERT | 2019 | Single-Transformer-Backbone | Enhances cross-modal reasoning with unified visual-language pretraining. | Pretraining + fine-tuning |
| UNITER | 2020 | Single-Transformer-Backbone | Reasons via joint contextual encoding in a single Transformer backbone. | Pretraining + fine-tuning |
| PixelBERT | 2020 | Single-Transformer-Backbone | Processes pixels with CNN+Transformer for fine-grained cross-modal reasoning. | Pretraining + fine-tuning |
| UniVL | 2020 | Single-Transformer-Backbone | Unifies video-language reasoning with a single Transformer for temporal tasks. | Pretraining + fine-tuning |
| Oscar | 2020 | Single-Transformer-Backbone | Anchors reasoning with object tags in a unified Transformer for semantic inference. | Pretraining + fine-tuning |
| VinVL | 2021 | Single-Transformer-Backbone | Boosts reasoning with enhanced visual features in a single Transformer. | Pretraining + fine-tuning |
| ERNIE-ViL | 2021 | Single-Transformer-Backbone | Integrates scene graph knowledge for structured visual-language reasoning. | Pretraining + fine-tuning |
| UniT | 2021 | Single-Transformer-Backbone | Streamlines multimodal tasks with a shared self-attention Transformer backbone. | Pretraining + fine-tuning |
| Flamingo | 2022 | Single-Transformer-Backbone | Prioritizes dynamic vision-text interactions via cross-attention. | Pretraining + fine-tuning |
| CoCa | 2022 | Single-Transformer-Backbone | Combines contrastive and generative heads for versatile cross-modal reasoning. | Contrastive + generative pretraining |
| BEiT-3 | 2022 | Single-Transformer-Backbone | Unifies vision-language learning with masked data modeling. | Pretraining + fine-tuning |
| OFA | 2022 | Single-Transformer-Backbone | Provides a unified multimodal framework for efficient cross-modal reasoning. | Pretraining + fine-tuning |
| PaLI | 2022 | Single-Transformer-Backbone | Scales reasoning with a multilingual single-Transformer framework. | Pretraining + fine-tuning |
| BLIP-2 | 2023 | Single-Transformer-Backbone | Uses a querying Transformer for improved cross-modal reasoning efficiency. | Pretraining + fine-tuning |
| Kosmos-1 | 2023 | Single-Transformer-Backbone | Enables interleaved input processing for flexible multimodal understanding. | Pretraining + fine-tuning |
| Kosmos-2 | 2023 | Single-Transformer-Backbone | Enhances grounding capability for precise object localization and reasoning. | Pretraining + fine-tuning |
| CLIPCap | 2021 | Vision-Encoder-LLM | Projects CLIP visual features into an LLM for reasoning and captioning. | Fine-tuning |
| LLaVA | 2023 | Vision-Encoder-LLM | Tunes ViT-LLM integration for conversational multimodal reasoning. | Instruction tuning |
| MiniGPT-4 | 2023 | Vision-Encoder-LLM | Aligns ViT to a frozen LLM via projection for streamlined reasoning. | Fine-tuning |
| InstructBLIP | 2023 | Vision-Encoder-LLM | Uses instruction tuning to align ViT with LLM for multimodal reasoning. | Instruction tuning |
| Qwen-VL | 2023 | Vision-Encoder-LLM | Incorporates spatial-aware ViT for enhanced grounded reasoning. | Pretraining + fine-tuning |
| mPLUG-Owl | 2023 | Vision-Encoder-LLM | Integrates modular visual encoder with LLM for instruction-following reasoning. | Instruction tuning |
| Otter | 2023 | Vision-Encoder-LLM | Combines modular visual encoder with LLM for in-context multimodal reasoning. | Instruction tuning |
With the advent of large-scale multimodal pretraining, MLLMs have started to demonstrate emergent reason- ing capabilities. However, such inferences are often shallow, relying primarily on implicit correlations rather than explicit logical processes. To mitigate this limitation, MCoT has emerged as a simple yet effective ap- proach. By incorporating intermediate reasoning steps, MCoT improves cross-modal alignment, knowledge integration, and contextual grounding, all without the need for extensive supervision or significant architec- tural modifications. In this stage, we categorize existing approaches into three paradigms: prompt-based MCoT, structural reasoning with predefined patterns, and tool-augmented reasoning leveraging lightweight external modules.
Figure 3: Taxonomy and representative methods of structural reasoning in multimodal chain-of-thought.
Click to expand Structural Reasoning table
| Name | Modality | Task | Reasoning Structure | Datasets | Highlight |
|---|---|---|---|---|---|
| Cantor | T,I | VQA | Perception, Decision | - | Decouples perception and reasoning via feature extraction and CoT-style integration. |
| TextCoT | T,I | VQA | Caption, Localization, Precise observation | - | First summarizes visual context, then generates CoT-based responses. |
| Grounding-Prompter | T,V,A | Temporal Sentence Grounding | Denoising | VidChapters-7M | Grounding-Prompter performs global parsing, denoising, partitioning before reasoning. |
| Audio-CoT | T,A | AQA | Manual-CoT, Zero-Shot-CoT, Desp-CoT | - | Enhances visual reasoning by utilizing three chain-of-thought paradigms. |
| VIC | I,T | VQA | Thinking before looking | - | Breaks tasks into text-based sub-steps before integrating visual inputs to form final rationales. |
| Visual Sketchpad | I,T | VQA, math QA | Sketch-based reasoning paradigm | - | Organizes rationales into "Thought, Action, Observation" phases. |
| Det-CoT | I,T | VQA | Subtask decomposition, Execution, and Verification | - | Formalizes VQA reasoning as a combination of subtasks and reviews. |
| BDoG | I,T | VQA | Entity update, Relation update, Graph pruning | - | Utilizes a dedicated debate-summarization pipeline with specialized agents. |
| CoTDet | I,T | object detection | Object listing, Affordance analysis, Visual feature summarization | COCO-Tasks | Achieves object detection via human-like procedure of listing, analyzing and summarizing. |
| CoCoT | I,T | VQA | Contrastive prompting strategy | - | Systematically contrasts input similarities and differences. |
| TeSO | T,A,V | Temporal Sentence Grounding | Visual summary, Sound filtering, Denoising | Youtube-8M, Semantic-ADE20K | Robustly localizes sounding objects in the visual space through global understanding, sounding object filtering, and noise removal. |
| Emma-X | I,T | Robotic task | Grounded CoT reasoning, Look-ahead spatial reasoning | Dataset based on BridgeV2 | Integrates grounded planning and predictive. |
| DDCoT | T,I | VQA | Question Deconstruct, Rationale | ScienceQA | Maintains a critical attitude by identifying reasoning and recognition responsibilities through the combined effect of negative-space design and visual deconstruction. |
| AVQA-CoT | T,A,V | AVQA | Question Deconstruct, Question Selection, Rationale | MUSIC-AVQA | Decomposes complex questions into multiple simpler sub-questions and leverages LLMs to select relevant sub-questions for audio-visual question answering. |
| CoT-PT | T,I | Image Classification, Image-Text Retrieval, VQA | Coarse-to-Fine Image Concept Representation | ImageNet | First to successfully adapt CoT for prompt tuning by combining visual and textual embeddings in the vision domain. |
| IoT | T,I | VQA | Visual Action Selection, Execution, Rationale, Summary, Self-Refine | - | Enhances visual reasoning by integrating visual and textual rationales through a model-driven multimodal reasoning chain. |
| Shikra | T,I | VQA, PointQA | Caption, Object Grounding | ScienceQA | Maintains a critical attitude by identifying reasoning and recognition responsibilities through the combined effect of negative-space design and visual deconstruction. |
| E-CoT | T,I,A | Policies' Generalization | Task Rephrase, Planning, Task Deconstruct, Object Grounding | Bidgedata v2 | Integrates semantic planning with low-level perceptual and motor reasoning, advancing task formulations in embodied intelligence. |
| CoS | T,I | VQA | Object Grounding, Rationale | Llava665K | Guides the model to identify and focus on key image regions relevant to a question, enabling multi-granularity understanding without compromising resolution. |
| TextCoT | T,I | VQA | Caption, Object Grounding, Image Zoom | Llava665K, SharedGPT4V | Enables accurate and interpretable multimodal question answering through staged processing: overview, coarse localization, and fine-grained observation. |
| DCoT | T,I | VQA | Object Grounding, Fine-Grained Image Generation, Similar Example Retrieve, Rationale | - | Uses a dual-guidance mechanism by combining bounding box cues to focus attention on relevant image regions and retrieving the most suitable examples from a curated demonstration cluster as contextual support. |
Click to expand Externally Augmented Reasoning table
| Name | Modality | Task | Enhancement Type | External Source | Highlight |
|---|---|---|---|---|---|
| MM-ToT | T,I | Image Generation | Search Algorithm | DFS,BFS | Applies DFS and BFS to select optimal outputs. |
| HoT | T,I | VQA | Search Algorithm | multi-hop random walks on graph | Generates linked thoughts from multimodal data in a hyperedge. |
| AGoT | T,I | Text-Image Retrieval, VQA | Search Algorithm | prompt aggregation and prompt flow operations | Builds a graph to aggregate multi-faceted reasoning with visuals. |
| BDoG | T,I | VQA | Search Algorithm | Graph Condensation: Entity update, Relation update, Graph pruning | Effective three-agent debate forms thought graph for multimodal queries. |
| L3GO | T,I | 3D Object Generation & Composition | Tools | Blender, ControlNet | Iterative part-based 3D construction through LLM reasoning in a simulation environment. |
| HDRA | T,I | Knowledge-QA, Visual Grounding | Tools | RL agent controller, Visual Foundation Models | RL agent controls multi-stage visual reasoning through dynamic instruction selection. |
| Det-CoT | T,I | object detection | Tools | Visual Processing Prompts | Visual prompts guide MLLM attention for structured detection reasoning. |
| Chain-of-Image | T,I | Geometric, chess & commonsense reasoning | Tools | Chain of Images prompting | Generates intermediate images during reasoning for visual pattern recognition. |
| AnyMAL | T, I, A, V | Cross-modal reasoning, multimodal QA | Tools | Pre-trained alignment module | Efficient integration of diverse modalities; strong reasoning via LLaMA-2 backend. |
| SE-CMRN | T,I | Visual Commonsense Reasoning | Tools | Syntactic Graph Convolutional Network | Enhances language-guided visual reasoning via syntactic GCN in a dual-branch network. |
| RAGAR | T,I | Political Fact-Checking | RAG | DuckDuckGo & SerpAPI | Integrates MLLMs with retrieval-augmented reasoning to verify facts using text and image evidence. |
| Chain-of-action | T,I | Info retrieval | RAG | Google Search, ChromaDB | Decomposes questions into reasoning chains with configurable retrieval actions to resolve conflicts between knowledge sources. |
| KAM-CoT | T,I, KG | Educational science reasoning | RAG | ConceptNet knowledge graph | Enhances reasoning by retrieving structured knowledge from graphs and integrating it through two-stage training. |
| AR-MCTS | T,I | Multi-step reasoning | RAG | Contriever, CLIP dual-stream | Step-wise retrieval with Monte Carlo Tree Search for verified reasoning. |
| MR-MKG | T, I | General multimodal reasoning | RAG | RGAT | Enhances multimodal reasoning by integrating information from multimodal knowledge graphs. |
| Reverse-HP | T, I | Disease-related reasoning | RAG | reverse hyperplane projection | Utilizes KG embeddings to enhance reasoning for specific diseases with multimodal data. |
| MarT | T, I | Analogical reasoning | RAG | Structure-guided relation transfer | Uses structure mapping theory and relation-oriented transfer for analogical reasoning with KG. |
| MCoT-Memory | T,I | VQA | Multimodal Information Enhancing | LLAVA | Memory framework and scene graph construction for effective long-horizon task planning |
| MGCoT | T,I | VQA | Multimodal Embedding Enhancing | ViT-large encoder | Precise visual feature extraction aiding multimodal reasoning |
| CCoT | T,I | VQA | Multimodal Perception Enhancing | Scene Graphs | Utilization of the generated scene graph as an intermediate reasoning step. |
| CVR-LLM | T,I | VQA | Multimodal Embedding Enhancing | BLIP2flant5 & BLIP2 multi-embedding | Precise context-aware image descriptions through iterative self-refinement and effective text-multimodal factors integrations |
| TeSO | T,V,A | Temporal Sentence Grounding (TSG) | Multimodal Information Enhancing | VGGish | Integrates text semantics to mitigate segmentation preference for better audio-visual correlation boosting AVS performance. |
| CAT | T,I | Image Captioning | Multimodal Perception Enhancing | SAM | Promising pre-trained image caption generators, SAM, and instruction-tuned large language models integration |
While structural reasoning introduces predefined patterns to guide MLLMs toward more systematic reason- ing, it remains constrained by shallow reasoning depth and limited adaptability. To handle more complex multimodal tasks, recent work aims to develop System-2-style reasoning (Kahneman, 2011). Unlike fast and reactive strategies, this form of reasoning is deliberate, compositional, and guided by explicit planning. By extending reasoning chains, grounding them in multimodal inputs, and training with supervised or reinforce- ment signals, these models begin to exhibit long-horizon reasoning and adaptive problem decomposition.
Figure4: Timeline (top) and core components (bottom) of recent multimodal O1-like and R1-like models.
Click to expand Cross-Modal Reasoning table
| Name | Modality | Cross-Modal Reasoning | Task | Highlight |
|---|---|---|---|---|
| IdealGPT | T, I | Answer sub-questions about image via gpt | VQA, Text Entailment | Using gpt to iteratively decompose and solve visual reasoning tasks |
| AssistGPT | T, I, V | Plan, Execute, Inspect via External Tools(gpt4, OCR, Grounding, et al.) | VQA, Causal Reasoning | Using an interleaved code and language reasoning approach to handle complex multimodal tasks |
| ProViQ | T, V | Generate and execute Python programs for the video | Video VQA | Using procedural programs to solve visual subtasks in videos |
| MM-REACT | T, I, V | Use CV tools for sub-taskss about image | VQA, Video VQA | Vision experts combined with GPT for multimodal reasoning and action |
| VisualReasoner | T, I | Synthesize multi-step reasoning(Using exteral CV tools) data | GQA, VQA | Proposing a least-to-most visual reasoning paradigm and a data synthesis approach for training |
| Multi-model-thought | T, I | External Tools(Visual Sketchpad) | Geometry, Math, VQA | Investigating inference-time scaling for multi-modal thought across diverse tasks |
| FaST | T, I | System switch adapter for visual reasoning | VQA | Integrating fast and slow thinking mechanisms into visual agents |
| ICoT | T, I | Generate interleaved visual-textual reasoning via ADS | VQA | Using visual patches as reasoning carriers to improve LMMs' fine-grained reasoning |
| Image-of-Thought | T, I | Extract visual rationales step-by-step via IoT prompting | VQA | Using visual rationales to enhance LLMs' reasoning accuracy and interpretability |
| CoTDiffusion | T, I | External Algorithms | Robotics | Generating subgoal images before action to enhance reasoning in long-horizon robot manipulation tasks |
| T-SciQ | T, I | Model-Intrinsic Capabilities | ScienceQA | Using LLM-generated reasoning signals to teach multimodal reasoning for complex science QA |
| Visual-CoT | T, I | Model-Intrinsic Capabilities | VQA, DocQA, ChartQA | Using visual-text pairs as reasoning carriers to bridge logical gaps in sequential data |
| VoCoT | T, I | Model-Intrinsic Capabilities | VQA | Using visually-grounded object-centric reasoning paths for multi-step reasoning |
| MVoT | T, I | Model-Intrinsic Capabilities | Spatial Reasoning | Using multimodal reasoning with image visualizations to enhance complex spatial reasoning in LMMs |
Click to expand MM-O1 table
| Name | Backbone | Dataset | Modality | Reasoning Paradigm | Task Type | Highlight |
|---|---|---|---|---|---|---|
| Macro-O1 | Qwen2-7B-Instruct | Open-O1 CoT + Marco-o1 CoT + Marco-o1 Instruction | T | MCTS-guided Thinking | Math, Translate | MCTS for solution expansion and reasoning action strategy |
| llamaberry | LLaMA-3.1-8B | PRM800K + OpenMathInstruct-1 | T | MCTS-guided Thinking | Math | SR-MCTS for search and PPRM for evaluation |
| RBF++ | LLaMA3-8B-Instruct | GSM8K, SVAMP, MATH | Text | SR-MCTS (Structured and Recursive MCTS) + PPRM | Math | Proposes SR-MCTS for structured search and PPRM for evaluating reasoning boundaries |
| LLaVA-CoT | Llama-3.2V-11B-cot | LLaVA-CoT-100k | T, I | Summary, Caption, Thinking | Science, General | Introduce LLaVA-CoT-100k and scalable beam search |
| LlamaV-o1 | Llama-3.2V-11B-cot | LLaVA-CoT-100k + PixMo | T, I | Summary, Caption, Thinking | Science, General | Introduce VCR-Bench and outperforms |
| Mulberry | Llama-3.2V-11B-cot, LLaVA-Next-8B, Qwen2-VL-7B | Mulberry-260K | T, I | Caption, Rationales, Thinking | Math, General | Introduce Mulberry-260k and CoMCTS for collective learning |
| RedStar-Geo | InternVL2-8B | GeoQA | T, I | Long-Thinking | Math | Competitive with minimal Long-CoT data |
Click to expand MM-R1 table
| Approach | Backbone | Dataset | RL Algorithm | Modality | Task Type | RL Framework | Cold Start | Rule-base/RM |
|---|---|---|---|---|---|---|---|---|
| RLHF-V | LLaVA-13B | RLHF-V-Dataset(1.4k) | DPO | T, I | VQA | Muffin | - | (unknown) |
| InternVL2.5 | InternVL | MMPR(3m) | MPO(DPO) | T, I | VQA | - | - | (unknown) |
| Insight-V | LLaMA3-LLaVA-Next | - | DPO | T, I | VQA | trl | - | (unknown) |
| LLaVA-Reasoner-DPO | LLaMA3-LLaVA-Next | ShareGPT4o-reasoning-dpo(6.6k) | DPO | T, I | VQA | trl | - | (unknown) |
| VLM-R1 | Qwen2.5-VL | coco , LISA , Refcoco | GRPO | T, I | Grounding ,Math , Open-Vocabulary Detection | trl | No | Rule-base |
| R1-V | Qwen2-VL | CLEVR , GEOQA | GRPO | T, I | Counting , Math | trl | No | Rule-base |
| MM-EUREKA | InternVL2.5 | K12 , MMPR | RLOO | T, I | Math | OpenRLHF | Yes | Rule-base |
| MM-EUREKA-Qwen | Qwen2.5-VL | K12 , MMPR | GRPO | T, I | Math | OpenRLHF | No | Rule-base |
| Video-R1 | Qwen2.5-VL | Video-R1(260K) | GRPO | T, I, V | Video VQA | trl | Yes | Rule-base |
| LMM-R1 | Qwen2.5-VL | VerMulti | PPO | T, I | Math | OpenRLHF | No | RM |
| Vision-R1 | Qwen2.5-VL | LLaVA-CoT , Mulberry | GRPO | T, I | Math | - | Yes | Rule-base |
| Visual-RFT | Qwen2-VL | coco , LISA , ... | GRPO | T, I | Detection , Classification | trl | No | Rule-base |
| STAR-R1 | Qwen2.5-VL-7B | TRANCE(13.5k) | GRPO | T, I | Spatial Reasoning (Transformation) | vLLM | No | Rule-base |
| VL-Rethinker | Qwen2.5-VL | MathVista, MathVerse, MathVision, MMMU-Pro, EMMA, MEGA | GRPO+SSR | T, I | Mathematical, Scientific, Real-world Reasoning | trl | No | Rule-base |
| Reason-RFT | Qwen2.5-VL | CLEVR-Math, Super-CLEVR, GeoMath, Geometry3K, TRANCE | GRPO | T, I | Counting, Structure Perception, Spatial Transformation | trl | No | Rule-base |
| R1-OneVision | Qwen2.5-VL | R1-Onevision-Dataset | GRPO | T, I | Math , Science , General , Doc | - | Yes | Rule-base |
| Seg-Zero | Qwen2.5-VL , SAM2 | RefCOCOg , ReasonSeg | GRPO | T, I | Grounding | verl | No | Rule-base |
| VisualThinker-R1-Zero | Qwen2-VL | SAT dataset | GRPO | T, I | Spatial Reasoning | trl | No | Rule-base |
| R1-Omni | HumanOmni | MAFW , DFEW | GRPO | T, I, A, V | emotion recognition | trl | Yes | Rule-base |
| OThink-MR1 | Qwen2.5-VL | CLEVR , GEOQA | GRPO | T, I | Counting , Math | - | No | Rule-base |
| Multimodal-Open-R1 | Qwen2-VL | multimodal-open-r1-8k-verified(based on Math360K and Geo170K) | GRPO | T,I | Math | trl | No | Rule-base |
| Curr-ReFT | Qwen2.5-VL | RefCOCOg , Math360K , Geo170K | GRPO | T,I | Detection , Classification , Math | Curr-RL | No | RM |
| Open-R1-Video | Qwen2-VL | open-r1-video-4k | GRPO | T, I, V | Video VQA | trl | No | Rule-base |
| VisRL | Qwen2.5-VL | VisCoT | DPO | T,I | VQA | trl | Yes | RM |
| R1-VL | Qwen2-VL | Mulberry-260k | StepGRPO | T,I | Math , ChartQA | not release | No | Rule-base |
| WEBAGENT-R1 | Qwen2.5-3B/Llama3.1-8B | WebArena-Lite | M-GRPO | T | web tasks | no release | Yes | RM |
| WavReward | Qwen2.5-Omni-7B-Think | ChatReward-30K | PPO | T,A | end-to-end dialogue | not release | No | Rule-base |
| VPRL | LVM-3B | FrozenLake, Maze, MiniBehavior | GRPO | I | Visual Spatial Planning | no release | Yes | Rule-base |
| VideoChat-R1 | Qwen2.5-VL-Instruct | Charade - STA + NExTGQA + FIBER-1k + VidTAB | GRPO | T, I, V | Video Grounding + Video VQA | trl | No | Rule-base |
| VerIPO | Qwen2.5-VL-Instruct | DAPO-Math + ViRL39K + VQA-Video-24K | GRPO + DPO | T, I, V | Video VQA + Spatial | OpenRLHF | No | Rule-base |
| VAU-R1 | Qwen2.5-VL-Instruct | VAU-Bench-Train | GRPO | T, I, V | Anomaly Understanding+ Video VQA + Video Grounding | trl | No | Rule-base |
| UnifiedReward-Think | UnifiedReward | HPD(25.6K),EvalMuse(3K),OpenAI-4o_t2i_human_preference (6.7K),VideoDPO (10K),Text2Video-Human Preferences (5.7K),ShareGPTVideo-DPO (17K) | GRPO | T,I,V | Video/Image Understanding,Reward Assessment | trl | yes | Rule-base |
| UIShift | Qwen2.5‑VL‑3B‑Instruct,Qwen2.5‑VL‑7B‑Instruct | no release | GRPO | T,I | GUI automation,GUI grounding | VLM-R1 | no | Rule-base |
| UI-R1 | Qwen2.5-VL-3B | ScreenSpot(mobile subset),AndroidControl(1K) | GRPO | T,I | GUI Action Prediction,GUI grounding | no release | no | Rule-base |
| TW-GRPO | Qwen2.5-VL-Instruct | CLEVRER dataset | GRPO | T, I, V | Video VQA | trl | No | Rule-base |
| TinyLLaVA-Video-R1 | Qwen2.5-VL-Instruct | NextQA | GRPO | T, I, V | Video VQA | trl | Yes | Rule-base |
| Time-R1 | Qwen2.5-VL-Instruct | YT-Temporal + DiDeMo + QuerYD + InternVid + HowTo100M + VTG-IT + TimeIT + TimePro + HTStep + LongVid | GRPO | T, I, V | Video Grounding | trl | Yes | Rule-base |
| Spatial-MLLM | Qwen2.5-VL-Instruct | Spatial-MLLM-120k | GRPO | T, I, V | Spatial | not release | Yes | Rule-base |
| SpaceR | Qwen2.5-VL-Instruct | SpaceR-151k | GRPO | T, I, V | Spatial + VideoVQA | trl | No | Rule-base |
| SoundMind | Qwen2.5-Omni-7B | Audio Logical Reasoning(ALR) | REINFORCE++ | T,A | Audio text bimodal reasoning | VeRL | No | Rule-base |
| Skywork-VL Reward | Qwen2.5-VL-7B-Instruct | LLaVA-Critic-113k,Skywork-Reward-Preference-80Kv0.2,RLAIF-V-Dataset | MPO | T,I | VQA,Math,Science,Reasoning | not release | no | Rule-base |
| ShapeLLM-0mni | Qwen-2.5-VL-Instruct-7B | 3D-Alpaca | Not explicitly mentioned (Uses autoregressive models) | T,I,3D | 3D Generation, 3D Understanding, 3D Editing | Not directly stated (Uses supervised fine-tuning andautoregressive training) | No | Rule-based |
| GRPO-CARE | Qwen2.5-VL-Instruct | SEED-Bench-R1-Train | GRPO | T, I, V | Video VQA + Spatial | trl | No | Rule-base |
| SARI | Qwen2-Audio-7B-Instruct/ Qwen2.5-Omni | AudioSet+MusicBench+Meld+AVQA | GRPO | T,A | Audio QA | trl | No | Rule-base |
| Router-R1 | Qwen2.5-3B-Instruct , LLaMA-3.2-3B-Instruct | Natural Questions, TriviaQA, PopQA; HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle | PPO | T | Multi-hop Question Answering | verl | Yes | RM + Rule-base |
| RM-R1 | Qwen-Instruct (7B/14B/32B), DeepSeek-Distilled-Qwen (7B/14B/32B) | Skywork-Reward-Preference, Code-Preference-Pairs, Math-DPO-10K | GRPO | T | Reward Modeling | verl | Yes | RM |
| LoVeC | Llama-3-8B-Instruct and Gemma-2-9B-It | WildHallu,Bios,PopQA | GRPO,DPO,and ORPO | T | long-form generation | TRL/vLLM | No | Rule-base+RM |
| ReFoCUS | LLaVA-OV / InternVL | ReFoCUS-962K | GRPO | T, I, V | Video VQA | not release | No | RM |
| ReCode | Qwen-2.5-Coder-7B-Instruct and DeepSeekv1.5-Coder-7B-Instruct | construct own training dataset | GRPO,DAPO | T | code generation | not release | No | Rule-base |
| R1-Zero-VSI | Qwen2-VL-Instruct | VSI-100k | GRPO | T, I, V | Spatial | not release | No | Rule-base |
| R1-Reward | QwenVL-2.5-7B-Instruct | RLAIF-V,VL-Feedback,POVID,WildVision-Battle | StableReinforce(Reinforce++ variant) | T,I,V | Video/Image Understanding,Reward Assessment | OpenRLHF | yes | Rule-base |
| R1-Code-Interpreter | Qwen-2.5-(3B,7B,14B) | SymBench,BIG-Bench-Hard,Reasoning-Gym | GRPO | T | planning | verl | Yes | RM |
| R1-AQA | Qwen2-Audio-7B-Instruct | AVQA | GRPO | T,A | Audio QA | trl | Yes | Rule-base |
| Phi-Omni-ST | - | - | - | - | - | - | - | - |
| Patho-R1 | OpenAI-CLIP/Qwen2.5VL | PubMed+Quilt+PathGen | GRPO+DAPO | T, I | Open-ended/Close-ended VQA | VeRL | Yes | Rule-base |
| GVM-RAFT | Qwen2.5-Math-1.5B and Qwen2.5-Math-7B | Numina-Math | Dynamic RAFT | T | Math | verl | No | Rule-base |
| Omni-R1 (ZJU) | Qwen2.5-Omni-7B | RefAVS,ReVOS,MeViS,refCOCOg | GRPO | T,V,A | Audio-Visual Segmentation(AVS),Reasoning Video Object Segmentation (VOS) | trl | Yes | Rule-base |
| Omni-R1 (MIT) | Qwen2.5-Omni-7B | AVQA-GPT,VGGS-GPT | GRPO | T,A | Audio QA | not release | no | RM |
| MUSEG | Qwen2.5-VL-Instruct | E.T. Instruct 164k + CharadesSTA | GRPO | T, I, V | Video VQA + Video Grounding | trl | No | Rule-base |
| MobileIPL | Qwen2-VL-7B | MobileIPL-dataset | DPO | T,I | GUI automation | no release | yes | Rule-base |
| Mixed-R1 | Qwen2.5-VL-(3B,7B) | Mixed-45K | GRPO | T, I,V | reasoning | no release | Yes | RM + Rule-base |
| Ming-Omni | Ming-Omni | OS-ATLAS, M2E, IM2LATEX-100K, Mini-CASIA-CSDB, CASIA-CSDB, DoTA, ICDAR23-SVRD, AitZ, AitW, GUICourse, OmniMedVQA, SLAKE, VQA-Med, Geometry3K, UniGeo, MAVIS, GeoS, PixMo-count, Geoqa+, GeomVerse, ChemVLM, TGIF-Transition, ShareGPT4Video, videogpt-plus, Llava-video-178k, Video-Vista, Neptune, FunQA, Temp-Compass, EgoTask, InternVid, CLEVRER, VLN-CE, Vript, Cinepile, OpenVid-1M, WenetSpeech, KeSpeech, AliMeeting, AISHELL-1, AISHELL-3, AISHELL-4, CoVoST, CoVoST2, Magicdata, Gigaspeech, Libriheavy, LibriSpeech, SlideSpeech, SPGISpeech, TED-LIUM, Emilla, Multilingual LibriSpeech, Peoples Speech | not release | T,I,V,A | Unified Omni-Modality Perception,Perception and Generation | not release | not release | not release |
| MedVLM-R1 | Qwen2-VL-2B | HuatuoGPT-Vision | GRPO | T, I | Radiological VQA | not release | Yes | Rule-base |
| Med-R1 | Qwen2-VL-2B-Instruct | OmniMedVQA | GRPO | T, I | medical VQA | not release | Yes | Rule-base |
| Lingshu | Qwen2.5-VL-Instruct | 3.75M open-source medical samples and 1.30M synthetic medical samples / MedEvalKit | GRPO | T, I | multimodal QA, text-based QA, and medical report generation | not release | Yes | Rule-base |
| AutoThink | DeepSeek-R1-Distill-Qwen-1.5B | MATH, Minerva, Olympiad, AIME24, AMC23 | GRPO | T | Mathematical Reasoning | verl | No | RM |
| InfiGUI-R1 | Qwen 2.5-VL-3B-Instruct | AndroidControl,ScreenSpot ,ScreenSpot-Pro,Widget-Caption,COCO | RLOO | T,I | GUI automation,GUI grounding | no release | no | Rule-base |
| GUI-R1 | QwenVL 2.5-3B/7B | GUI-R1-3K | GRPO | T,I | GUI automation,GUI grounding | EasyR1 | no | Rule-base |
| GUI-G1 | Qwen2.5‑VL‑3B‑Instruct | UI-BERT and OS-Atlas (17K) | GRPO | T,I | GUI grounding | no release | no | Rule-base |
| GUI-Critic-R1 | Qwen2.5‑VL‑7B‑Instruct | GUI-Critic-Train | GRPO | T,I | GUI Operation Error Detection and Correction | no release | yes | Rule-base |
| GRIT | Qwen2.5-VL-3B and InternVL-3-2B | VSR,TallyQA,GQA,MME,MathVista,OVDEval | GRPO | T,I | explicit visual grounding and multi-step reasoning | Deepspeed Zero2 | No | Rule-base+RM |
| FinLMM-R1 | Qwen2.5-VL-3B | FinData | GRPO | T,I | Reasoning | TAR-LMM | No | RM |
| EchoInk-R1 | Qwen2.5-Omni-7B | AVQA-R1-6K | GRPO | T,I,A | Audio VQA | trl | no | Rule-base |
| DeepVideo-R1 | Qwen2.5-VL-Instruct | SEED-Bench-R1-Train + NExTGQA | GRPO | T, I, V | Video VQA | not release | No | Rule-base |
| Critique-GRPO | Qwen2.5-7B-Base and Qwen3-8B-Base | OpenR1-Math-220k | GRPO | T | mathematical, STEM, and general reasoning | verl | Yes | RM |
| ComfyUI-R1 | Qwen2.5-Coder-7B-Instruct | no release | GRPO | T,I,V | workflow generation | no release | yes | Rule-base |
| ChestX-Reasoner | Qwen2VL-7B | train: MIMIC-CXR+CheXpert+MS-CXR-T+CheXpert+MIMIC-CXR+RSNA+SIIM/eval: RadRBench-CXR | GRPO | T, I | single/binary disease diagnosis | VeRL | Yes | Rule-base |
| AV-Reasoner | Ola-Omni7B | AVQA,Music AVQA,AVE,UnAV,LLP,AVSS-ARIG,DVD-Counting,RepCount | GRPO | T,I,V,A | Counting + Video VQA + (Spatial + Temporal + Grounding) + Reasoning | trl | Yes | Rule-base |
| AudSemThinker | Qwen2.5-Omni-7B | AUDSEM | GRPO | T,A | semantic audio reasoning | trl | No | Rule-base |
| Audio-Reasoner | Qwen2-Audio-7B-Instruct | AVQA | GRPO | T,A | Audio QA | not release | Yes | Rule-base |
| ARPO | UI-Tars-1.5-7B | OS World | GRPO | T,I | GUI automation | VERL | no | Rule-base |
| Ada-R1 | DeepSeek-R1-Distill-Qwen (7B, 1.5B) | GSM8K, MATH, AIME | DPO | T | Math | Bi-Level Preference Training | No | RM |
| ViCrit | Qwen2.5-VL-7B-Instruct,Qwen2.5-VL-72B-Instruct | PixMo-Cap | GRPO | T,I | Hallucination Detection | not release | No | Rule-base |
| Vision Matters | Qwen2.5-VL-Instruct | Geometry3K,TQA,GeoQA,Math8K,M3CoT | GRPO + DPO | T,I | Math | MS-Swift(DPO),EasyR1(GRPO) | No | RM |
| ViGaL | Qwen2.5-VL-7B-Instruct | Sampled from game: Snake(36K), Rotation(36K) | RLOO | T,I | Visual Games | OpenRLHF | No | Rule-base |
| RAP | Qwen2.5-VL-3B,Qwen2.5-VL-7B | MM-Eureka | GRPO, RLOO | T,I | Data Selection | EasyR1 | No | Not metion |
| RACRO | Qwen2.5-VL(3B, 7B, 32B) | ViRL39K | CRO | T,I | change reasoner without re-alignment | verl | No | combine |
| ReVisual-R1 | Qwen2.5-VL-7B-Instruct | GRAMMAR | GRPO | T,I | Math | EasyR1 | Yes | Rule-base |
| Rex-Thinker | Qwen2.5-VL-7B | HumanRef-CoT | GRPO | T,I | Object Referring (REC) | verl | Yes | RM |
| ControlThinker | ControlAR | COCOStuff, MultiGen-20M | GRPO | T,I | Image Editing | no release | Yes | RM |
| SynthRL | Qwen2.5-VL-7B-Instruct | MMK12, A-MMK12 | GRPO | T,I | Math | verl | No | RM |
| SRPO | Qwen-2.5-VL-7B, Qwen-2.5-VL-32B | Mulberry dataset (260K), MathV360K, and LLaVA-CoT dataset (100K) , ScienceQA , Geometric Math QA, ChartQA , DVQA, AI2D , MATH, Virgo , R1-OneVision , MMK12, and PhyX | GRPO | T,I | Math | verl | Yes | RM |
| ReasonGen-R1 | Janus-Pro-7B | LAION-5B | GRPO | T,I | Text to Image Generation | verl | Yes | RM |
| MoDoMoDo | Qwen2-VL-2B-Instruct | COCO, LISA, GeoQAV, SAT, ScienceQA | GRPO | T, I | General Visual Reasoning | trl | No | RM |
| DINO-R1 | MM-Grounding-DINO | Objects365 | GRPO | T, I | Object Detection | no release | Yes | RM |
| VisualSphinx | Qwen2.5-VL-7B | VISUALSPHINX | GRPO | T, I | visual logic puzzle, math | verl | No | Rule-base |
| PixelThink | Qwen2.5-VL-7B, SAM2-Large | RefCOCOg | GRPO | T, I | Segmentation | verl | No | Rule-base |
| ViGoRL | Qwen2.5-VL-3B, Qwen2.5-VL-7B | SAT-2, OS-ATLAS, ICAL, Segment Anything | GRPO | T, I | spatial reasoning、web grounding、web action prediction、visual search | verl | Yes | Rule-base |
| Jigsaw-R1 | Qwen2.5-VL-7B, Qwen2.5-VL-3B, Qwen2-VL-2B, InternVL2.5-2B | COCO, CV-Bench, MMVP, SAT, Super-CLEVR | GRPO | T, I | jigsaw puzzles | trl | No | Rule-base |
| UniRL | Show-o, Janus | COCO, GPT4o-Generated | GRPO | T, I | Image Understanding and Generation | no release | Yes | Rule-base |
| cadrille | Qwen2-VL-2B | DeepCAD | DPO, GRPO | T, I | CAD | no release | Yes | Rule-base |
| MM-UPT | Qwen2.5-VL-7B | Geo3K、GeoQA、MMR1 | GRPO | T, I | Math | verl | No | Rule-base |
| RL-with-Cold-Start | Qwen2.5-VL-3B, Qwen2.5-VL-7B | Geometry3K, GeoQA, GeoQA-Plus, Geos, AI2D, TQA, FigureQA, TabMWP, ChartQA, IconQA, Clevr-Math, M3CoT, and ScienceQA | GRPO | T, I | Multimodal Reasoning, especailly Math | verl | Yes | Rule-base |
| VRAG-RL | Qwen2.5-VL-3B, Qwen2.5-VL-7B | ViDoSeek, SlideVQA, MMLongBench | GRPO | T, I | Visually Rich Information Understanding | verl | Yes | RM + Rule-base |
| MLRM-Halu | Qwen2.5-VL(3B,7B) | MMMU, MMVP, MMBench, MMStar, MMEval-Pro,VMCBench | GRPO | T,I | reasoning,perception | norelease | Yes | Rule-base |
| Active-O3 | Qwen2.5-VL-7B | SODA,LVIS | GRPO | T,I | active perception | no release | Yes | RM |
| RLRF | Qwen2.5-VL(3B,72B),Qwen3-8B | SVG-Stack | GRPO | T,I | Inverse rendering | no release | Yes | RM |
| VisTA | Qwen2.5-VL-7B | ChartQA,Geometry3K | GRPO | T,I | Visual Reasoning,Tool Selection | openR1 | Yes | RM+Rule-base |
| SATORI-R1 | Qwen2.5-VL-Instruct-3B | Text-Total,ICDAR2013,ICDAR2015,CTW1500,COCOText,LSVT,MLT | GRPO | T,I | task-critical regions,answer accuracy | no release | No | RM |
| URSA | Qwen2.5 Math-Instruct , SAM-B+SigLIP-L | DualMath-1.1M | GRPO | T,I | data reasoning,reward hacking | URSA | No | RM |
| v1 | Qwen2-VL(7B,72B),Qwen2.5-VL(7B,72B) | v1g | No | T,I | retrieve regions | - | No | No |
| GRE Suite | Qwen2.5VL(3B,7B,32B) | Im2GPS3k,GWS15k | GRPO | T,I | reasoning location | LLaMA-Factory | Yes | RM+Rule-base |
| V-Triune | Qwen2.5-VL-7B-Instruct,Qwen2.5-VL-32B-Instruct | mm_math,geometry3k,mmk12,PuzzleVQA,AlgoPuzzleVQA,VisualPuzzles, ScienceQA,SciVQA , ViRL39K,ChartQAPro,ChartX,Table-VQA, ViRL39K, V3Det,Object365, 𝐷3, CLEVR, LLaVA-OV Data, EST-VQA | GRPO | T,I | intensive perception | verl | Yes | RM |
| RePrompt | Qwen2.5 7B | GenEva | GRPO | T,I | image generation | trl | Yes | RM |
| GoT-R1 | Qwen2.5VL-7B | JourneyDB-GoT,FLUX-GoT | GRPO | T,I | semantic-spatial reasoning | no release | No | RM |
| SophiaVL-R1 | Qwen2.5-VL-7B-Instruct | SophiaVL-R1-130k | GRPO | T,I | reasoning-specific,general vision-language understanding | VeRL | No | RM+Rule-base |
| R1-ShareVL | Qwen2.5-VL-7B and Qwen2.5-VL-32B | MM-Eureka | GRPO | T,I | General Visual Reasoning | EasyR1 | No | Rule-base |
| VLM-R^3 | Qwen2.5-VL-7B | VLIR | GRPO | T,I | Region Recognition and Reasoning | DeepSpeed | Yes | Rule-base |
| TON | Qwen-2.5-VL-Instruct-3B/7B | CLEVR,Super-CLEVR,GeoQA,AITZ | GRPO | T,I | spanning counting, mobile agent navigation, and mathematical reasoning | vLLM | Yes | Rule-base |
| Pixel Reasoner | Qwen2.5-VL-7B | SA1B,FineWeb and STARQA | GRPO | T,I | pixel-space reasoning | OpenRLHF | No | Rule-base |
| VARD | - | SCOPe,Pick-a-Pic,ImageRewardDB | No | T,I | image generation | not release | No | RM |
| Chain-of-Focus | Qwen2.5-VL-7B | MM-CoF,SA_1B,TextVQA,m3cot,V⋆,POPE | GRPO | T,I | visual search and reasoning | not release | Yes | Rule-base |
| Visionary-R1 | Qwen2.5-VL-3B | A-OKVQA,ChartQA,AI2D,ScienceQA,GeoQA+,DocVQA,CLEVR-Math,Icon-QA,TabMWP,RoBUTSQA,TextVQA | GRPO | T,I | VQA | not release | No | Rule-base |
| VisualQuality-R1 | Qwen2.5-VL-7B | KADID-10K,SPAQ | GRPO | T,I | image quality scoring | not release | No | Rule-base |
| DeepEyes | Qwen2.5-VL-7B | Fine-grained:V∗ training set Chart:ArxivQA Reasoning:ThinkLite-VL | GRPO | T, I | Multimodal Reasoning | verl | No | Rule-base |
| Visual-ARFT | Qwen2.5-VL(3B,7B) | MAT-Search, MAT-Coding,2WikiMultihopQA,HotpotQA,MuSiQue,Bamboogle | GRPO | T, I | Multimodal Agentic Reasoning | no release | No | Rule-base |
| UniVG-R1 | Qwen2-VL-2B 7B | MGrounding-630k,RefCOCO/+/g,RefCOCO,MIG-Bench, LISA-Grounding,LLMSeg-Grounding,ReVOS Grounding,ReasonVOS Grounding | GRPO | T, I,V | Visual Grounding (Multi-image Context, Complex Instructions) | Open-R1 | Yes | RM+Rule-base |
| G1 | Qwen2.5-VL-7B | a batch size of 128 parallel games and a group size of 5 for 500 training steps per game. | GRPO | T, I | Interactive Game Decision-Making | EasyR1 | Yes | Rule-base |
| VisionReasoner | Seg-Zero? | COCO,RefCOCO(+/g) RefCOCO(+/g),ReasonSeg PixMo-Count,CountBench | GRPO | T, I | detection, segmentation, counting | no release | No | Rule-base |
| GuardReasoner-VL | Qwen2.5-VL Instruct 3B and Qwen2.5-VL-Instruct 7B | GuardReasoner-VLTrain | GRPO(omit the KL divergence loss) | T, I | Moderation (Prompt & Response Harmfulness Detection) | EasyR1 | Yes | Rule-base |
| OpenThinkIMG | Qwen2-VL-2B-Instruct | CHARTGEMMA | GRPO | T,I | Chart Reasoning | V-TOOL RL?Open-R1 | Yes | Rule-base |
| DanceGRPO | Stable Diffusion,HunyuanVideo,FLUX,SkyReels-I2V | curated prompt dataset,VidProM | GRPO | T,I | Text-to-Video Generation, Image-to-Video Generation,Text-to-ImageGeneration | fastvideo | No | RM |
| Flow-GRPO | SD3.5-M | GenEval,OCR,from pickscore | GRPO | T,I | Composition Image Generation,Visual Text Rendering,Human Preference Alignment | no release | No | RM(pickscore),Rule-base(GenEval,ocr) |
| X-Reasoner | Qwen2.5-VL-7B-Instruct | OpenThoughts,Orz-math,MedQA | GRPO | T,I | Generalization across domains and modalities | no release | No | Rule-base |
| T2I-R1 | Janus-Pro-7B | T2I-CompBench | GRPO | T, I | Text-to-Image Generation | Open-R1 | No | RM |
| VIDEO-RTS | Qwen2.5-VL-7B-Instruct | CG-Bench, 6K MCQA | GRPO | T, V | Video Understanding | TRL | No | Rule-base |
Large Multimodal Reasoning Models (LMRMs) have demonstrated potential in handling complex tasks with long chain-of-thought. However, their language-centric architectures constrain their effectiveness in real- world scenarios. Specifically, their reliance on vision and language modalities limits their capacity to process and reason over interleaved diverse data types, while their performance in real-time, iterative interactions with dynamic environments remains underdeveloped. These limitations underscore the need for a new class of models capable of broader multimodal integration and more advanced interactive reasoning.
Click to expand N-LMRMs(Agentic Models) table
| Model | Parameter | Input Modality | Output Modality | Training Strategy | Task | Characteristic |
|---|---|---|---|---|---|---|
| R1-Searcher | 7B, 8B | T | T | RL | Multi-Hop QA | RL-Enhanced LLM Search |
| Search-o1 | 32B | T | T | Training-Free | Multi-Hop QA, Math | Agentic Search-Augmented Reasoning |
| DeepResearcher | 7B | T | T | RL | Multi-Hop QA | RL in Live Search Engines |
| Magma | 8B | T, I, V | T | Pretrain | Multimodal Understanding, Spatial Reasoning | 820K Spatial-Verbal Labeled Data |
| OpenVLA | 7B | T, I | T | SFT | Spatial Reasoning | 970k Real-World Robot Demonstrations |
| CogAgent | 18B | T, I | T | Pretrain+SFT | VQA, GUI navigation | Low-High Resolution Encoder Synergy |
| UI-TARS | 2B, 7B, 72B | T, I | T | Pretrain+SFT+RL | VQA, GUI navigation | End-to-End GUI Reasoning and Action |
| Seeclick | 10B | T, I | T | Pretrain+SFT | GUI navigation | Screenshot-Based Task Automation |
| Embodied-Reasoner | 7B | T, I | T, A | Pretrain+SFT | GUI navigation | Image-Text Interleaved Long-Horizon Embodied Reasoning |
| Seed1.5-VL | 20B | T, I, V | T | Pretrain+SFT+RL | GUI, Multimodal Understanding and Reasoning | General-purpose Multimodal Understanding and Reasoning with Iterative Reinforcement Learning |
| RIG | 1.4B (Janus) | T, I | T, A, I | Pretrain + SFT + Imagination Alignment | Minecraft Embodied Tasks, Image Generation, Reasoning | Synergized Reasoning & Imagination, End-to-End Generalist Policy, 17× Sample Efficiency, Lookahead Self-Correction |
Click to expand N-LMRMs(Omni-Modal Models) table
| Model | Parameter | Input Modality | Output Modality | Training Strategy | Task | Characteristic |
|---|---|---|---|---|---|---|
| Gemini 2.0 & 2.5 | / | T, I, A, V | T, I, A | / | / | / |
| GPT-4o | / | T, I, A, V | T, I | / | / | / |
| Megrez-3B-Omni | 3B | T, I, A | T | Pretrain+SFT | VQA, OCR, ASR, Math, Code | Multimodal Encoder-Connector-LLM |
| Qwen2.5-Omni | 7B | T, I, A, V | T, A | Pretrain+SFT | VQA, OCR, ASR, Math, Code | Time-Aligned Multimodal RoPE |
| Baichuan-Omni-1.5 | 7B | T, I, A, V | T, A | Pretrain+SFT | VQA, OCR, ASR, Math, GeneralQA | Leading Medical Image Understanding |
| M2-omni | 9B, 72B | T, I, A, V | T, I, A | Pretrain+SFT | VQA, OCR, ASR, Math, GeneralQA | Step Balance For Pretraining and Adaptive Balance For SFT |
| MiniCPM-o 2.6 | 8B | T, I, A, V | T, A | Pretrain+SFT+RL | VQA, OCR, ASR, AST | Parallel Multimodal Streaming Processing |
| Mini-Omni2 | 0.5B | T, I, A | A | Pretrain+SFT | VQA, ASR, AQA, GeneralQA | Real-Time and End-to-End Voice Response |
| R1-Omni | 0.5B | T, A, V | T | RL | Emotion Recognition | RL with Verifiable Reward |
| Janus-Pro | 1B, 7B | T, I | T, I | Pretrain+SFT | Multimodal Understanding, Text-to-Image | Decoupling Visual Encoding For Understanding and Generation |
| AnyGPT | 7B | T, I, A | T, I, A | Pretrain | Multimodal-to-Text and Text-to-Multimodal | Discrete Representations For Unified Processing |
| Uni-MoE | 13B, 20B, 22B, 37B | T, I, A, V | T | Pretrain+SFT | VQA, AQA | Modality-Specific Encoders with Connectors for Unified Representation |
| Ovis-U1 | 3B | T, I | T, I | Pretrain+SFT | Multimodal understanding, T2I, Image Editing | Unified training from LLM and diffusion decoder with token refiner |
| ShapeLLM-Omni | 7B | 3D, I, T | 3D, T | Pre-trained+SFT | Text-to-3D, Image-to-3D, 3D understanding, interactive 3D editing | Uses 3D VQVAE to tokenize meshes for a unified autoregressive framework |
| Ming-Omni | 2.8B | T, I, A, V | I, T, A | Pretrain+SFT | Multimodal understanding & generation | MoE LLM with modality-specific routers; connects specialized decoders to a frozen perception core. |
| BAGEL | 14B (7B active) | T, I, V | I, T, V | Pretrain+CT+SFT | Multimodal understanding & generation | Unified decoder-only MoT architecture |
Figure 5: Case study of OpenAI o3’s long multimodal chain-of-thought, reaching the correct answer after 8 minutes and 13 seconds of reasoning.
Figure 6: Case study of OpenAI o3: Find locations, solve a puzzle and create multimedia contents.
Figure 7: Case study of OpenAI o3: Visual problem solving and file processing.
Figure 8: Overview of next-generation native large multimodal reasoning model. The envisioned system aims
to achieve comprehensive perception across diverse real-world data modalities, enabling precise omnimodal
understanding and in-depth generative reasoning. This foundational model will lead to more advanced forms
of intelligent behavior, learning from world experience and realizing lifelong learning and self-improvement.
Figure 9: The outlines of datasets and benchmarks. We reorganize the multimodal datasets and benchmarks into four main categories: Understanding, Generation, Reasoning, and Planning.
Click to expand Datasets and Benchmarks
| Benchmark | Dataset |
|---|---|
| AudioBench, VoiceBench, Fleurs, MusicBench | Librispeech, Common Voice, Aishell, Fleurs |
| AIR-Bench, MMAU, SD-eval, CoVoST2 | MELD, CoVoST2, SIFT-50M, Clotho |
| MusicNet, ACVUBench | AudioCaps, ClothoAQA, MusicNet, NSynth |
| MusicCaps, AVE-PM |
| Benchmark | Dataset |
|---|---|
| MM-Interleaved, ANOLE | DreamLLM, SEED-Story |
| InterleavedEval, OpenLEAF | NextGPT, DreamFactory |
| OpenING, M2RAG | DreamRunner, EVA |
| SEED-Bench, SEED-Bench-2 | |
| MME-Unify, ChartEdit | |
| RealFactBench, FrontendBench |
| Benchmark | Dataset |
|---|---|
| WebArena, Mind2Web, VisualWebBench, OSWorld | AMEX, RiCo, WebSRC, E-ANT |
| OmniACT, VisualAgentBench, LlamaTouch, Windows Agent Arena | AndroidEnv, GUI-World, MBE-ARI |
| Ferret-UI, WebShop, SWE-BENCH M, MineDojo | |
| TeamCraft, V-MAGE, BEARCUBS, TongUI | |
| ThinkGeo, MCA-Bench, Agent-RewardBench, AgentBench | |
| RealWebAssist, OSWorld, SPA-Bench |
In this paper, we survey the evolution of multimodal reasoning models, highlighting pivotal advancements and paradigm-shifting milestones in the field. While current models predominantly adopt a language-centric reasoning paradigm—delivering impressive results in tasks like visual question answering and text-image retrieval—critical challenges persist. Notably, visual-centric long reasoning (e.g., understanding object relations or 3D contexts, addressing visual information seeking questions) and interactive multimodal reasoning (e.g., dynamic cross-modal dialogue or iterative feedback loops) remain underdeveloped frontiers requiring deeper exploration.
Building on empirical evaluations and experimental insights, we propose a forward-looking framework for inherently multimodal large models that transcend language-dominated architectures. Such models should prioritize three core capabilities:
- Multimodal Agentic Reasoning: Enabling proactive environmental interaction (e.g., embodied AI agents that learn through real-world trial and error)
- Omni-Modal Understanding and Generative Reasoning:
- Integrating any-modal semantics (e.g., aligning abstract concepts across vision, audio, and text) while resolving ambiguities in complex, open-world contexts
- Producing coherent, context-aware outputs across modalities (e.g., generating diagrams from spoken instructions or synthesizing video narratives from text)
By addressing these dimensions, future models could achieve human-like contextual adaptability, bridging the gap between isolated task performance and generalized, real-world problem-solving.
We express sincere gratitude for the valuable contributions of all researchers and students involved in this work.
We welcome the community to contribute to the development of this survey, and we will regularly update it to reflect the latest research.
Please feel free to submit issues or contact us via email at liyunxin987@163.com.