Skip to content

The development and future prospects of large multimodal reasoning models.

Notifications You must be signed in to change notification settings

HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

296 Commits
 
 
 
 
 
 

Repository files navigation

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Harbin Institute of Technology, Shenzhen
If you like our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments.
We welcome recommendations for uncovered work. 🚀 Please suggest additions via issues or email to help us update this repository.

Citation

If you find this work useful for your research, please cite our paper:

@article{li2025perception,
  title={Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models},
  author={Li, Yunxin and Liu, Zhenyu and Li, Zitao and Zhang, Xuanyu and Xu, Zhenran and Chen, Xinyu and Shi, Haoyuan and Jiang, Shenyuan and Wang, Xintong and Wang, Jifang and Huang, Shouzheng and Zhao, Xinping and Jiang, Borui and Hong, Lanqing and Wang, Longyue and Tian, Zhuotao and Huai, Baoxing and Luo, Wenhan and Luo, Weihua and Zhang, Zheng and Hu, Baotian and Zhang, Min},
  journal={arXiv preprint arXiv:2505.04921},
  year={2025}
}

News

🔥 Latest 👉 [2025/08/02] We updated the recommended works about multimodal reasoning in our repo and paper. You are welcome to recommend your work to us.

🔥 Latest 👉 [2025/07/06] We collected recent multimodal reasoning models and benchmarks (about 150 papers in 2025.05~2025.06) in our new version and repository. You are welcome to recommend your work to us.

🔥 Latest Updates (Click to See More News)

[2025/05/20] 🏮 We have updated some uncovered works (issues) in the following subsections and papers, continuously introducing newest works. You are welcome to recommend your work to us.

[2025/05/11] 🏮 Exciting news! Our survey was quickly highlighted as the first paper for May 2025 on Hugging Face Daily Papers. Check it out: https://huggingface.co/papers/2505.04921.

[2025/05/09] 🏮 We've analyzed 550+ papers charting the rise of Large Multimodal Reasoning Models (LMRMs). Discover the 4-stage journey from basic modules to advanced MCoT/RL, envisioning Native LMRMs (e.g., capability scope and level, technical prospect) for comprehensive perception, precise understanding, deep reasoning and planning.

About

Advances on multimodal reasoning models and a collection of related datasets and benchmarks

progress
Figure 1: The core evolving path of large multimodal reasoning models.

Table of Contents

1 Overview

progress
Figure 2: The roadmap of large multimodal reasoning models.

Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities—aiming to achieve comprehensive perception, precise understanding, and deep reasoning.

As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior.

We present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities.

First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion.

Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains.

Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI-O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

2 Roadmap of Multimodal Reasoning Models

2.1 Stage 1 Perception Driven Reasoning - Developing Task-Specific Reasoning Modules

2.1.1 Modular Reasoning Networks

Click to expand Modular Reasoning Networks table
Model Year Architecture Highlight Training Method
NMN 2016 Modular Dynamically assembles task-specific modules for visual-textual reasoning. Supervised learning
HieCoAtt 2016 Attention-based Aligns question semantics with image regions via hierarchical cross-modal attention. Supervised learning
MCB 2016 Bilinear Optimizes cross-modal feature interactions with efficient bilinear modules. Supervised learning
SANs 2016 Attention-based Iteratively refines reasoning through multiple attention hops over visual features. Supervised learning
DMN 2016 Memory-based Integrates memory modules for multi-episode reasoning over sequential inputs. Supervised learning
ReasonNet 2017 Modular Decomposes reasoning into entity-relation modules for structured inference. Supervised learning
UpDn 2018 Attention-based Combines bottom-up and top-down attention for object-level reasoning. Supervised learning
MAC 2018 Memory-based Uses a memory-augmented control unit for iterative compositional reasoning. Supervised learning
BAN 2018 Bilinear Captures high-order interactions via bilinear attention across modalities. Supervised learning
HeteroMemory 2019 Memory-based Synchronizes appearance and motion modules for video-based temporal reasoning. Supervised learning
MuRel 2019 Relational Models reasoning as a relational network over object pairs for fine-grained inference. Supervised learning
MCAN 2019 Attention-based Employs modular co-attention with self- and guided-attention for deep reasoning. Supervised learning

2.1.2 Vision-Language Models-based Modular Reasoning

Click to expand Vision-Language Models table
Model Year Architecture Highlight Training Method
ViLBERT 2019 Dual-Encoder Aligns visual-text features via dual-stream Transformers with cross-modal attention. Pretraining + fine-tuning
LXMERT 2019 Dual-Encoder Enhances cross-modal reasoning with dual-stream pretraining on diverse tasks. Pretraining + fine-tuning
X-LXMERT 2020 Dual-Encoder Extends dual-stream reasoning with generative cross-modal pretraining. Pretraining + fine-tuning
ALBEF 2021 Dual-Encoder Integrates contrastive learning with momentum distillation for robust reasoning. Contrastive + generative pretraining
SimVLM 2021 Dual-Encoder Uses prefix-based pretraining for flexible cross-modal reasoning. Pretraining + fine-tuning
VLMo 2022 Dual-Encoder Employs a mixture-of-modality-experts for dynamic cross-modal reasoning. Pretraining + fine-tuning
METER 2022 Dual-Encoder Enhances reasoning with a modular encoder-decoder for robust alignment. Pretraining + fine-tuning
BLIP 2022 Dual-Encoder Bootstraps alignment with contrastive learning for efficient reasoning. Contrastive + generative pretraining
VisualBERT 2019 Single-Transformer-Backbone Fuses visual-text inputs in a single Transformer for joint contextual reasoning. Pretraining + fine-tuning
VL-BERT 2019 Single-Transformer-Backbone Enhances cross-modal reasoning with unified visual-language pretraining. Pretraining + fine-tuning
UNITER 2020 Single-Transformer-Backbone Reasons via joint contextual encoding in a single Transformer backbone. Pretraining + fine-tuning
PixelBERT 2020 Single-Transformer-Backbone Processes pixels with CNN+Transformer for fine-grained cross-modal reasoning. Pretraining + fine-tuning
UniVL 2020 Single-Transformer-Backbone Unifies video-language reasoning with a single Transformer for temporal tasks. Pretraining + fine-tuning
Oscar 2020 Single-Transformer-Backbone Anchors reasoning with object tags in a unified Transformer for semantic inference. Pretraining + fine-tuning
VinVL 2021 Single-Transformer-Backbone Boosts reasoning with enhanced visual features in a single Transformer. Pretraining + fine-tuning
ERNIE-ViL 2021 Single-Transformer-Backbone Integrates scene graph knowledge for structured visual-language reasoning. Pretraining + fine-tuning
UniT 2021 Single-Transformer-Backbone Streamlines multimodal tasks with a shared self-attention Transformer backbone. Pretraining + fine-tuning
Flamingo 2022 Single-Transformer-Backbone Prioritizes dynamic vision-text interactions via cross-attention. Pretraining + fine-tuning
CoCa 2022 Single-Transformer-Backbone Combines contrastive and generative heads for versatile cross-modal reasoning. Contrastive + generative pretraining
BEiT-3 2022 Single-Transformer-Backbone Unifies vision-language learning with masked data modeling. Pretraining + fine-tuning
OFA 2022 Single-Transformer-Backbone Provides a unified multimodal framework for efficient cross-modal reasoning. Pretraining + fine-tuning
PaLI 2022 Single-Transformer-Backbone Scales reasoning with a multilingual single-Transformer framework. Pretraining + fine-tuning
BLIP-2 2023 Single-Transformer-Backbone Uses a querying Transformer for improved cross-modal reasoning efficiency. Pretraining + fine-tuning
Kosmos-1 2023 Single-Transformer-Backbone Enables interleaved input processing for flexible multimodal understanding. Pretraining + fine-tuning
Kosmos-2 2023 Single-Transformer-Backbone Enhances grounding capability for precise object localization and reasoning. Pretraining + fine-tuning
CLIPCap 2021 Vision-Encoder-LLM Projects CLIP visual features into an LLM for reasoning and captioning. Fine-tuning
LLaVA 2023 Vision-Encoder-LLM Tunes ViT-LLM integration for conversational multimodal reasoning. Instruction tuning
MiniGPT-4 2023 Vision-Encoder-LLM Aligns ViT to a frozen LLM via projection for streamlined reasoning. Fine-tuning
InstructBLIP 2023 Vision-Encoder-LLM Uses instruction tuning to align ViT with LLM for multimodal reasoning. Instruction tuning
Qwen-VL 2023 Vision-Encoder-LLM Incorporates spatial-aware ViT for enhanced grounded reasoning. Pretraining + fine-tuning
mPLUG-Owl 2023 Vision-Encoder-LLM Integrates modular visual encoder with LLM for instruction-following reasoning. Instruction tuning
Otter 2023 Vision-Encoder-LLM Combines modular visual encoder with LLM for in-context multimodal reasoning. Instruction tuning

2.2 Stage 2 Language-Centric Short Reasoning - System-1 Reasoning

With the advent of large-scale multimodal pretraining, MLLMs have started to demonstrate emergent reason- ing capabilities. However, such inferences are often shallow, relying primarily on implicit correlations rather than explicit logical processes. To mitigate this limitation, MCoT has emerged as a simple yet effective ap- proach. By incorporating intermediate reasoning steps, MCoT improves cross-modal alignment, knowledge integration, and contextual grounding, all without the need for extensive supervision or significant architec- tural modifications. In this stage, we categorize existing approaches into three paradigms: prompt-based MCoT, structural reasoning with predefined patterns, and tool-augmented reasoning leveraging lightweight external modules.

progress
Figure 3: Taxonomy and representative methods of structural reasoning in multimodal chain-of-thought.

2.2.1 Prompt-based MCoT

2.2.2 Structural Reasoning

Click to expand Structural Reasoning table
Name Modality Task Reasoning Structure Datasets Highlight
Cantor T,I VQA Perception, Decision - Decouples perception and reasoning via feature extraction and CoT-style integration.
TextCoT T,I VQA Caption, Localization, Precise observation - First summarizes visual context, then generates CoT-based responses.
Grounding-Prompter T,V,A Temporal Sentence Grounding Denoising VidChapters-7M Grounding-Prompter performs global parsing, denoising, partitioning before reasoning.
Audio-CoT T,A AQA Manual-CoT, Zero-Shot-CoT, Desp-CoT - Enhances visual reasoning by utilizing three chain-of-thought paradigms.
VIC I,T VQA Thinking before looking - Breaks tasks into text-based sub-steps before integrating visual inputs to form final rationales.
Visual Sketchpad I,T VQA, math QA Sketch-based reasoning paradigm - Organizes rationales into "Thought, Action, Observation" phases.
Det-CoT I,T VQA Subtask decomposition, Execution, and Verification - Formalizes VQA reasoning as a combination of subtasks and reviews.
BDoG I,T VQA Entity update, Relation update, Graph pruning - Utilizes a dedicated debate-summarization pipeline with specialized agents.
CoTDet I,T object detection Object listing, Affordance analysis, Visual feature summarization COCO-Tasks Achieves object detection via human-like procedure of listing, analyzing and summarizing.
CoCoT I,T VQA Contrastive prompting strategy - Systematically contrasts input similarities and differences.
TeSO T,A,V Temporal Sentence Grounding Visual summary, Sound filtering, Denoising Youtube-8M, Semantic-ADE20K Robustly localizes sounding objects in the visual space through global understanding, sounding object filtering, and noise removal.
Emma-X I,T Robotic task Grounded CoT reasoning, Look-ahead spatial reasoning Dataset based on BridgeV2 Integrates grounded planning and predictive.
DDCoT T,I VQA Question Deconstruct, Rationale ScienceQA Maintains a critical attitude by identifying reasoning and recognition responsibilities through the combined effect of negative-space design and visual deconstruction.
AVQA-CoT T,A,V AVQA Question Deconstruct, Question Selection, Rationale MUSIC-AVQA Decomposes complex questions into multiple simpler sub-questions and leverages LLMs to select relevant sub-questions for audio-visual question answering.
CoT-PT T,I Image Classification, Image-Text Retrieval, VQA Coarse-to-Fine Image Concept Representation ImageNet First to successfully adapt CoT for prompt tuning by combining visual and textual embeddings in the vision domain.
IoT T,I VQA Visual Action Selection, Execution, Rationale, Summary, Self-Refine - Enhances visual reasoning by integrating visual and textual rationales through a model-driven multimodal reasoning chain.
Shikra T,I VQA, PointQA Caption, Object Grounding ScienceQA Maintains a critical attitude by identifying reasoning and recognition responsibilities through the combined effect of negative-space design and visual deconstruction.
E-CoT T,I,A Policies' Generalization Task Rephrase, Planning, Task Deconstruct, Object Grounding Bidgedata v2 Integrates semantic planning with low-level perceptual and motor reasoning, advancing task formulations in embodied intelligence.
CoS T,I VQA Object Grounding, Rationale Llava665K Guides the model to identify and focus on key image regions relevant to a question, enabling multi-granularity understanding without compromising resolution.
TextCoT T,I VQA Caption, Object Grounding, Image Zoom Llava665K, SharedGPT4V Enables accurate and interpretable multimodal question answering through staged processing: overview, coarse localization, and fine-grained observation.
DCoT T,I VQA Object Grounding, Fine-Grained Image Generation, Similar Example Retrieve, Rationale - Uses a dual-guidance mechanism by combining bounding box cues to focus attention on relevant image regions and retrieving the most suitable examples from a curated demonstration cluster as contextual support.

2.2.3 Externally Augmented Reasoning

Click to expand Externally Augmented Reasoning table
Name Modality Task Enhancement Type External Source Highlight
MM-ToT T,I Image Generation Search Algorithm DFS,BFS Applies DFS and BFS to select optimal outputs.
HoT T,I VQA Search Algorithm multi-hop random walks on graph Generates linked thoughts from multimodal data in a hyperedge.
AGoT T,I Text-Image Retrieval, VQA Search Algorithm prompt aggregation and prompt flow operations Builds a graph to aggregate multi-faceted reasoning with visuals.
BDoG T,I VQA Search Algorithm Graph Condensation: Entity update, Relation update, Graph pruning Effective three-agent debate forms thought graph for multimodal queries.
L3GO T,I 3D Object Generation & Composition Tools Blender, ControlNet Iterative part-based 3D construction through LLM reasoning in a simulation environment.
HDRA T,I Knowledge-QA, Visual Grounding Tools RL agent controller, Visual Foundation Models RL agent controls multi-stage visual reasoning through dynamic instruction selection.
Det-CoT T,I object detection Tools Visual Processing Prompts Visual prompts guide MLLM attention for structured detection reasoning.
Chain-of-Image T,I Geometric, chess & commonsense reasoning Tools Chain of Images prompting Generates intermediate images during reasoning for visual pattern recognition.
AnyMAL T, I, A, V Cross-modal reasoning, multimodal QA Tools Pre-trained alignment module Efficient integration of diverse modalities; strong reasoning via LLaMA-2 backend.
SE-CMRN T,I Visual Commonsense Reasoning Tools Syntactic Graph Convolutional Network Enhances language-guided visual reasoning via syntactic GCN in a dual-branch network.
RAGAR T,I Political Fact-Checking RAG DuckDuckGo & SerpAPI Integrates MLLMs with retrieval-augmented reasoning to verify facts using text and image evidence.
Chain-of-action T,I Info retrieval RAG Google Search, ChromaDB Decomposes questions into reasoning chains with configurable retrieval actions to resolve conflicts between knowledge sources.
KAM-CoT T,I, KG Educational science reasoning RAG ConceptNet knowledge graph Enhances reasoning by retrieving structured knowledge from graphs and integrating it through two-stage training.
AR-MCTS T,I Multi-step reasoning RAG Contriever, CLIP dual-stream Step-wise retrieval with Monte Carlo Tree Search for verified reasoning.
MR-MKG T, I General multimodal reasoning RAG RGAT Enhances multimodal reasoning by integrating information from multimodal knowledge graphs.
Reverse-HP T, I Disease-related reasoning RAG reverse hyperplane projection Utilizes KG embeddings to enhance reasoning for specific diseases with multimodal data.
MarT T, I Analogical reasoning RAG Structure-guided relation transfer Uses structure mapping theory and relation-oriented transfer for analogical reasoning with KG.
MCoT-Memory T,I VQA Multimodal Information Enhancing LLAVA Memory framework and scene graph construction for effective long-horizon task planning
MGCoT T,I VQA Multimodal Embedding Enhancing ViT-large encoder Precise visual feature extraction aiding multimodal reasoning
CCoT T,I VQA Multimodal Perception Enhancing Scene Graphs Utilization of the generated scene graph as an intermediate reasoning step.
CVR-LLM T,I VQA Multimodal Embedding Enhancing BLIP2flant5 & BLIP2 multi-embedding Precise context-aware image descriptions through iterative self-refinement and effective text-multimodal factors integrations
TeSO T,V,A Temporal Sentence Grounding (TSG) Multimodal Information Enhancing VGGish Integrates text semantics to mitigate segmentation preference for better audio-visual correlation boosting AVS performance.
CAT T,I Image Captioning Multimodal Perception Enhancing SAM Promising pre-trained image caption generators, SAM, and instruction-tuned large language models integration

2.3 Stage 3 Language-Centric Long Reasoning - System-2 Thinking and Planning

While structural reasoning introduces predefined patterns to guide MLLMs toward more systematic reason- ing, it remains constrained by shallow reasoning depth and limited adaptability. To handle more complex multimodal tasks, recent work aims to develop System-2-style reasoning (Kahneman, 2011). Unlike fast and reactive strategies, this form of reasoning is deliberate, compositional, and guided by explicit planning. By extending reasoning chains, grounding them in multimodal inputs, and training with supervised or reinforce- ment signals, these models begin to exhibit long-horizon reasoning and adaptive problem decomposition.

progress
Figure4: Timeline (top) and core components (bottom) of recent multimodal O1-like and R1-like models.

2.3.1 Cross-Modal Reasoning

Click to expand Cross-Modal Reasoning table
Name Modality Cross-Modal Reasoning Task Highlight
IdealGPT T, I Answer sub-questions about image via gpt VQA, Text Entailment Using gpt to iteratively decompose and solve visual reasoning tasks
AssistGPT T, I, V Plan, Execute, Inspect via External Tools(gpt4, OCR, Grounding, et al.) VQA, Causal Reasoning Using an interleaved code and language reasoning approach to handle complex multimodal tasks
ProViQ T, V Generate and execute Python programs for the video Video VQA Using procedural programs to solve visual subtasks in videos
MM-REACT T, I, V Use CV tools for sub-taskss about image VQA, Video VQA Vision experts combined with GPT for multimodal reasoning and action
VisualReasoner T, I Synthesize multi-step reasoning(Using exteral CV tools) data GQA, VQA Proposing a least-to-most visual reasoning paradigm and a data synthesis approach for training
Multi-model-thought T, I External Tools(Visual Sketchpad) Geometry, Math, VQA Investigating inference-time scaling for multi-modal thought across diverse tasks
FaST T, I System switch adapter for visual reasoning VQA Integrating fast and slow thinking mechanisms into visual agents
ICoT T, I Generate interleaved visual-textual reasoning via ADS VQA Using visual patches as reasoning carriers to improve LMMs' fine-grained reasoning
Image-of-Thought T, I Extract visual rationales step-by-step via IoT prompting VQA Using visual rationales to enhance LLMs' reasoning accuracy and interpretability
CoTDiffusion T, I External Algorithms Robotics Generating subgoal images before action to enhance reasoning in long-horizon robot manipulation tasks
T-SciQ T, I Model-Intrinsic Capabilities ScienceQA Using LLM-generated reasoning signals to teach multimodal reasoning for complex science QA
Visual-CoT T, I Model-Intrinsic Capabilities VQA, DocQA, ChartQA Using visual-text pairs as reasoning carriers to bridge logical gaps in sequential data
VoCoT T, I Model-Intrinsic Capabilities VQA Using visually-grounded object-centric reasoning paths for multi-step reasoning
MVoT T, I Model-Intrinsic Capabilities Spatial Reasoning Using multimodal reasoning with image visualizations to enhance complex spatial reasoning in LMMs

2.3.2 MM-O1

Click to expand MM-O1 table
Name Backbone Dataset Modality Reasoning Paradigm Task Type Highlight
Macro-O1 Qwen2-7B-Instruct Open-O1 CoT + Marco-o1 CoT + Marco-o1 Instruction T MCTS-guided Thinking Math, Translate MCTS for solution expansion and reasoning action strategy
llamaberry LLaMA-3.1-8B PRM800K + OpenMathInstruct-1 T MCTS-guided Thinking Math SR-MCTS for search and PPRM for evaluation
RBF++ LLaMA3-8B-Instruct GSM8K, SVAMP, MATH Text SR-MCTS (Structured and Recursive MCTS) + PPRM Math Proposes SR-MCTS for structured search and PPRM for evaluating reasoning boundaries
LLaVA-CoT Llama-3.2V-11B-cot LLaVA-CoT-100k T, I Summary, Caption, Thinking Science, General Introduce LLaVA-CoT-100k and scalable beam search
LlamaV-o1 Llama-3.2V-11B-cot LLaVA-CoT-100k + PixMo T, I Summary, Caption, Thinking Science, General Introduce VCR-Bench and outperforms
Mulberry Llama-3.2V-11B-cot, LLaVA-Next-8B, Qwen2-VL-7B Mulberry-260K T, I Caption, Rationales, Thinking Math, General Introduce Mulberry-260k and CoMCTS for collective learning
RedStar-Geo InternVL2-8B GeoQA T, I Long-Thinking Math Competitive with minimal Long-CoT data

2.3.3 MM-R1

Click to expand MM-R1 table
Approach Backbone Dataset RL Algorithm Modality Task Type RL Framework Cold Start Rule-base/RM
RLHF-V LLaVA-13B RLHF-V-Dataset(1.4k) DPO T, I VQA Muffin - (unknown)
InternVL2.5 InternVL MMPR(3m) MPO(DPO) T, I VQA - - (unknown)
Insight-V LLaMA3-LLaVA-Next - DPO T, I VQA trl - (unknown)
LLaVA-Reasoner-DPO LLaMA3-LLaVA-Next ShareGPT4o-reasoning-dpo(6.6k) DPO T, I VQA trl - (unknown)
VLM-R1 Qwen2.5-VL coco , LISA , Refcoco GRPO T, I Grounding ,Math , Open-Vocabulary Detection trl No Rule-base
R1-V Qwen2-VL CLEVR , GEOQA GRPO T, I Counting , Math trl No Rule-base
MM-EUREKA InternVL2.5 K12 , MMPR RLOO T, I Math OpenRLHF Yes Rule-base
MM-EUREKA-Qwen Qwen2.5-VL K12 , MMPR GRPO T, I Math OpenRLHF No Rule-base
Video-R1 Qwen2.5-VL Video-R1(260K) GRPO T, I, V Video VQA trl Yes Rule-base
LMM-R1 Qwen2.5-VL VerMulti PPO T, I Math OpenRLHF No RM
Vision-R1 Qwen2.5-VL LLaVA-CoT , Mulberry GRPO T, I Math - Yes Rule-base
Visual-RFT Qwen2-VL coco , LISA , ... GRPO T, I Detection , Classification trl No Rule-base
STAR-R1 Qwen2.5-VL-7B TRANCE(13.5k) GRPO T, I Spatial Reasoning (Transformation) vLLM No Rule-base
VL-Rethinker Qwen2.5-VL MathVista, MathVerse, MathVision, MMMU-Pro, EMMA, MEGA GRPO+SSR T, I Mathematical, Scientific, Real-world Reasoning trl No Rule-base
Reason-RFT Qwen2.5-VL CLEVR-Math, Super-CLEVR, GeoMath, Geometry3K, TRANCE GRPO T, I Counting, Structure Perception, Spatial Transformation trl No Rule-base
R1-OneVision Qwen2.5-VL R1-Onevision-Dataset GRPO T, I Math , Science , General , Doc - Yes Rule-base
Seg-Zero Qwen2.5-VL , SAM2 RefCOCOg , ReasonSeg GRPO T, I Grounding verl No Rule-base
VisualThinker-R1-Zero Qwen2-VL SAT dataset GRPO T, I Spatial Reasoning trl No Rule-base
R1-Omni HumanOmni MAFW , DFEW GRPO T, I, A, V emotion recognition trl Yes Rule-base
OThink-MR1 Qwen2.5-VL CLEVR , GEOQA GRPO T, I Counting , Math - No Rule-base
Multimodal-Open-R1 Qwen2-VL multimodal-open-r1-8k-verified(based on Math360K and Geo170K) GRPO T,I Math trl No Rule-base
Curr-ReFT Qwen2.5-VL RefCOCOg , Math360K , Geo170K GRPO T,I Detection , Classification , Math Curr-RL No RM
Open-R1-Video Qwen2-VL open-r1-video-4k GRPO T, I, V Video VQA trl No Rule-base
VisRL Qwen2.5-VL VisCoT DPO T,I VQA trl Yes RM
R1-VL Qwen2-VL Mulberry-260k StepGRPO T,I Math , ChartQA not release No Rule-base
WEBAGENT-R1 Qwen2.5-3B/Llama3.1-8B WebArena-Lite M-GRPO T web tasks no release Yes RM
WavReward Qwen2.5-Omni-7B-Think ChatReward-30K PPO T,A end-to-end dialogue not release No Rule-base
VPRL LVM-3B FrozenLake, Maze, MiniBehavior GRPO I Visual Spatial Planning no release Yes Rule-base
VideoChat-R1 Qwen2.5-VL-Instruct Charade - STA + NExTGQA + FIBER-1k + VidTAB GRPO T, I, V Video Grounding + Video VQA trl No Rule-base
VerIPO Qwen2.5-VL-Instruct DAPO-Math + ViRL39K + VQA-Video-24K GRPO + DPO T, I, V Video VQA + Spatial OpenRLHF No Rule-base
VAU-R1 Qwen2.5-VL-Instruct VAU-Bench-Train GRPO T, I, V Anomaly Understanding+ Video VQA + Video Grounding trl No Rule-base
UnifiedReward-Think UnifiedReward HPD(25.6K),EvalMuse(3K),OpenAI-4o_t2i_human_preference (6.7K),VideoDPO (10K),Text2Video-Human Preferences (5.7K),ShareGPTVideo-DPO (17K) GRPO T,I,V Video/Image Understanding,Reward Assessment trl yes Rule-base
UIShift Qwen2.5‑VL‑3B‑Instruct,Qwen2.5‑VL‑7B‑Instruct no release GRPO T,I GUI automation,GUI grounding VLM-R1 no Rule-base
UI-R1 Qwen2.5-VL-3B ScreenSpot(mobile subset),AndroidControl(1K) GRPO T,I GUI Action Prediction,GUI grounding no release no Rule-base
TW-GRPO Qwen2.5-VL-Instruct CLEVRER dataset GRPO T, I, V Video VQA trl No Rule-base
TinyLLaVA-Video-R1 Qwen2.5-VL-Instruct NextQA GRPO T, I, V Video VQA trl Yes Rule-base
Time-R1 Qwen2.5-VL-Instruct YT-Temporal + DiDeMo + QuerYD + InternVid + HowTo100M + VTG-IT + TimeIT + TimePro + HTStep + LongVid GRPO T, I, V Video Grounding trl Yes Rule-base
Spatial-MLLM Qwen2.5-VL-Instruct Spatial-MLLM-120k GRPO T, I, V Spatial not release Yes Rule-base
SpaceR Qwen2.5-VL-Instruct SpaceR-151k GRPO T, I, V Spatial + VideoVQA trl No Rule-base
SoundMind Qwen2.5-Omni-7B Audio Logical Reasoning(ALR) REINFORCE++ T,A Audio text bimodal reasoning VeRL No Rule-base
Skywork-VL Reward Qwen2.5-VL-7B-Instruct LLaVA-Critic-113k,Skywork-Reward-Preference-80Kv0.2,RLAIF-V-Dataset MPO T,I VQA,Math,Science,Reasoning not release no Rule-base
ShapeLLM-0mni Qwen-2.5-VL-Instruct-7B 3D-Alpaca Not explicitly mentioned (Uses autoregressive models) T,I,3D 3D Generation, 3D Understanding, 3D Editing Not directly stated (Uses supervised fine-tuning andautoregressive training) No Rule-based
GRPO-CARE Qwen2.5-VL-Instruct SEED-Bench-R1-Train GRPO T, I, V Video VQA + Spatial trl No Rule-base
SARI Qwen2-Audio-7B-Instruct/ Qwen2.5-Omni AudioSet+MusicBench+Meld+AVQA GRPO T,A Audio QA trl No Rule-base
Router-R1 Qwen2.5-3B-Instruct , LLaMA-3.2-3B-Instruct Natural Questions, TriviaQA, PopQA; HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle PPO T Multi-hop Question Answering verl Yes RM + Rule-base
RM-R1 Qwen-Instruct (7B/14B/32B), DeepSeek-Distilled-Qwen (7B/14B/32B) Skywork-Reward-Preference, Code-Preference-Pairs, Math-DPO-10K GRPO T Reward Modeling verl Yes RM
LoVeC Llama-3-8B-Instruct and Gemma-2-9B-It WildHallu,Bios,PopQA GRPO,DPO,and ORPO T long-form generation TRL/vLLM No Rule-base+RM
ReFoCUS LLaVA-OV / InternVL ReFoCUS-962K GRPO T, I, V Video VQA not release No RM
ReCode Qwen-2.5-Coder-7B-Instruct and DeepSeekv1.5-Coder-7B-Instruct construct own training dataset GRPO,DAPO T code generation not release No Rule-base
R1-Zero-VSI Qwen2-VL-Instruct VSI-100k GRPO T, I, V Spatial not release No Rule-base
R1-Reward QwenVL-2.5-7B-Instruct RLAIF-V,VL-Feedback,POVID,WildVision-Battle StableReinforce(Reinforce++ variant) T,I,V Video/Image Understanding,Reward Assessment OpenRLHF yes Rule-base
R1-Code-Interpreter Qwen-2.5-(3B,7B,14B) SymBench,BIG-Bench-Hard,Reasoning-Gym GRPO T planning verl Yes RM
R1-AQA Qwen2-Audio-7B-Instruct AVQA GRPO T,A Audio QA trl Yes Rule-base
Phi-Omni-ST - - - - - - - -
Patho-R1 OpenAI-CLIP/Qwen2.5VL PubMed+Quilt+PathGen GRPO+DAPO T, I Open-ended/Close-ended VQA VeRL Yes Rule-base
GVM-RAFT Qwen2.5-Math-1.5B and Qwen2.5-Math-7B Numina-Math Dynamic RAFT T Math verl No Rule-base
Omni-R1 (ZJU) Qwen2.5-Omni-7B RefAVS,ReVOS,MeViS,refCOCOg GRPO T,V,A Audio-Visual Segmentation(AVS),Reasoning Video Object Segmentation (VOS) trl Yes Rule-base
Omni-R1 (MIT) Qwen2.5-Omni-7B AVQA-GPT,VGGS-GPT GRPO T,A Audio QA not release no RM
MUSEG Qwen2.5-VL-Instruct E.T. Instruct 164k + CharadesSTA GRPO T, I, V Video VQA + Video Grounding trl No Rule-base
MobileIPL Qwen2-VL-7B MobileIPL-dataset DPO T,I GUI automation no release yes Rule-base
Mixed-R1 Qwen2.5-VL-(3B,7B) Mixed-45K GRPO T, I,V reasoning no release Yes RM + Rule-base
Ming-Omni Ming-Omni OS-ATLAS, M2E, IM2LATEX-100K, Mini-CASIA-CSDB, CASIA-CSDB, DoTA, ICDAR23-SVRD, AitZ, AitW, GUICourse, OmniMedVQA, SLAKE, VQA-Med, Geometry3K, UniGeo, MAVIS, GeoS, PixMo-count, Geoqa+, GeomVerse, ChemVLM, TGIF-Transition, ShareGPT4Video, videogpt-plus, Llava-video-178k, Video-Vista, Neptune, FunQA, Temp-Compass, EgoTask, InternVid, CLEVRER, VLN-CE, Vript, Cinepile, OpenVid-1M, WenetSpeech, KeSpeech, AliMeeting, AISHELL-1, AISHELL-3, AISHELL-4, CoVoST, CoVoST2, Magicdata, Gigaspeech, Libriheavy, LibriSpeech, SlideSpeech, SPGISpeech, TED-LIUM, Emilla, Multilingual LibriSpeech, Peoples Speech not release T,I,V,A Unified Omni-Modality Perception,Perception and Generation not release not release not release
MedVLM-R1 Qwen2-VL-2B HuatuoGPT-Vision GRPO T, I Radiological VQA not release Yes Rule-base
Med-R1 Qwen2-VL-2B-Instruct OmniMedVQA GRPO T, I medical VQA not release Yes Rule-base
Lingshu Qwen2.5-VL-Instruct 3.75M open-source medical samples and 1.30M synthetic medical samples / MedEvalKit GRPO T, I multimodal QA, text-based QA, and medical report generation not release Yes Rule-base
AutoThink DeepSeek-R1-Distill-Qwen-1.5B MATH, Minerva, Olympiad, AIME24, AMC23 GRPO T Mathematical Reasoning verl No RM
InfiGUI-R1 Qwen 2.5-VL-3B-Instruct AndroidControl,ScreenSpot ,ScreenSpot-Pro,Widget-Caption,COCO RLOO T,I GUI automation,GUI grounding no release no Rule-base
GUI-R1 QwenVL 2.5-3B/7B GUI-R1-3K GRPO T,I GUI automation,GUI grounding EasyR1 no Rule-base
GUI-G1 Qwen2.5‑VL‑3B‑Instruct UI-BERT and OS-Atlas (17K) GRPO T,I GUI grounding no release no Rule-base
GUI-Critic-R1 Qwen2.5‑VL‑7B‑Instruct GUI-Critic-Train GRPO T,I GUI Operation Error Detection and Correction no release yes Rule-base
GRIT Qwen2.5-VL-3B and InternVL-3-2B VSR,TallyQA,GQA,MME,MathVista,OVDEval GRPO T,I explicit visual grounding and multi-step reasoning Deepspeed Zero2 No Rule-base+RM
FinLMM-R1 Qwen2.5-VL-3B FinData GRPO T,I Reasoning TAR-LMM No RM
EchoInk-R1 Qwen2.5-Omni-7B AVQA-R1-6K GRPO T,I,A Audio VQA trl no Rule-base
DeepVideo-R1 Qwen2.5-VL-Instruct SEED-Bench-R1-Train + NExTGQA GRPO T, I, V Video VQA not release No Rule-base
Critique-GRPO Qwen2.5-7B-Base and Qwen3-8B-Base OpenR1-Math-220k GRPO T mathematical, STEM, and general reasoning verl Yes RM
ComfyUI-R1 Qwen2.5-Coder-7B-Instruct no release GRPO T,I,V workflow generation no release yes Rule-base
ChestX-Reasoner Qwen2VL-7B train: MIMIC-CXR+CheXpert+MS-CXR-T+CheXpert+MIMIC-CXR+RSNA+SIIM/eval: RadRBench-CXR GRPO T, I single/binary disease diagnosis VeRL Yes Rule-base
AV-Reasoner Ola-Omni7B AVQA,Music AVQA,AVE,UnAV,LLP,AVSS-ARIG,DVD-Counting,RepCount GRPO T,I,V,A Counting + Video VQA + (Spatial + Temporal + Grounding) + Reasoning trl Yes Rule-base
AudSemThinker Qwen2.5-Omni-7B AUDSEM GRPO T,A semantic audio reasoning trl No Rule-base
Audio-Reasoner Qwen2-Audio-7B-Instruct AVQA GRPO T,A Audio QA not release Yes Rule-base
ARPO UI-Tars-1.5-7B OS World GRPO T,I GUI automation VERL no Rule-base
Ada-R1 DeepSeek-R1-Distill-Qwen (7B, 1.5B) GSM8K, MATH, AIME DPO T Math Bi-Level Preference Training No RM
ViCrit Qwen2.5-VL-7B-Instruct,Qwen2.5-VL-72B-Instruct PixMo-Cap GRPO T,I Hallucination Detection not release No Rule-base
Vision Matters Qwen2.5-VL-Instruct Geometry3K,TQA,GeoQA,Math8K,M3CoT GRPO + DPO T,I Math MS-Swift(DPO),EasyR1(GRPO) No RM
ViGaL Qwen2.5-VL-7B-Instruct Sampled from game: Snake(36K), Rotation(36K) RLOO T,I Visual Games OpenRLHF No Rule-base
RAP Qwen2.5-VL-3B,Qwen2.5-VL-7B MM-Eureka GRPO, RLOO T,I Data Selection EasyR1 No Not metion
RACRO Qwen2.5-VL(3B, 7B, 32B) ViRL39K CRO T,I change reasoner without re-alignment verl No combine
ReVisual-R1 Qwen2.5-VL-7B-Instruct GRAMMAR GRPO T,I Math EasyR1 Yes Rule-base
Rex-Thinker Qwen2.5-VL-7B HumanRef-CoT GRPO T,I Object Referring (REC) verl Yes RM
ControlThinker ControlAR COCOStuff, MultiGen-20M GRPO T,I Image Editing no release Yes RM
SynthRL Qwen2.5-VL-7B-Instruct MMK12, A-MMK12 GRPO T,I Math verl No RM
SRPO Qwen-2.5-VL-7B, Qwen-2.5-VL-32B Mulberry dataset (260K), MathV360K, and LLaVA-CoT dataset (100K) , ScienceQA , Geometric Math QA, ChartQA , DVQA, AI2D , MATH, Virgo , R1-OneVision , MMK12, and PhyX GRPO T,I Math verl Yes RM
ReasonGen-R1 Janus-Pro-7B LAION-5B GRPO T,I Text to Image Generation verl Yes RM
MoDoMoDo Qwen2-VL-2B-Instruct COCO, LISA, GeoQAV, SAT, ScienceQA GRPO T, I General Visual Reasoning trl No RM
DINO-R1 MM-Grounding-DINO Objects365 GRPO T, I Object Detection no release Yes RM
VisualSphinx Qwen2.5-VL-7B VISUALSPHINX GRPO T, I visual logic puzzle, math verl No Rule-base
PixelThink Qwen2.5-VL-7B, SAM2-Large RefCOCOg GRPO T, I Segmentation verl No Rule-base
ViGoRL Qwen2.5-VL-3B, Qwen2.5-VL-7B SAT-2, OS-ATLAS, ICAL, Segment Anything GRPO T, I spatial reasoning、web grounding、web action prediction、visual search verl Yes Rule-base
Jigsaw-R1 Qwen2.5-VL-7B, Qwen2.5-VL-3B, Qwen2-VL-2B, InternVL2.5-2B COCO, CV-Bench, MMVP, SAT, Super-CLEVR GRPO T, I jigsaw puzzles trl No Rule-base
UniRL Show-o, Janus COCO, GPT4o-Generated GRPO T, I Image Understanding and Generation no release Yes Rule-base
cadrille Qwen2-VL-2B DeepCAD DPO, GRPO T, I CAD no release Yes Rule-base
MM-UPT Qwen2.5-VL-7B Geo3K、GeoQA、MMR1 GRPO T, I Math verl No Rule-base
RL-with-Cold-Start Qwen2.5-VL-3B, Qwen2.5-VL-7B Geometry3K, GeoQA, GeoQA-Plus, Geos, AI2D, TQA, FigureQA, TabMWP, ChartQA, IconQA, Clevr-Math, M3CoT, and ScienceQA GRPO T, I Multimodal Reasoning, especailly Math verl Yes Rule-base
VRAG-RL Qwen2.5-VL-3B, Qwen2.5-VL-7B ViDoSeek, SlideVQA, MMLongBench GRPO T, I Visually Rich Information Understanding verl Yes RM + Rule-base
MLRM-Halu Qwen2.5-VL(3B,7B) MMMU, MMVP, MMBench, MMStar, MMEval-Pro,VMCBench GRPO T,I reasoning,perception norelease Yes Rule-base
Active-O3 Qwen2.5-VL-7B SODA,LVIS GRPO T,I active perception no release Yes RM
RLRF Qwen2.5-VL(3B,72B),Qwen3-8B SVG-Stack GRPO T,I Inverse rendering no release Yes RM
VisTA Qwen2.5-VL-7B ChartQA,Geometry3K GRPO T,I Visual Reasoning,Tool Selection openR1 Yes RM+Rule-base
SATORI-R1 Qwen2.5-VL-Instruct-3B Text-Total,ICDAR2013,ICDAR2015,CTW1500,COCOText,LSVT,MLT GRPO T,I task-critical regions,answer accuracy no release No RM
URSA Qwen2.5 Math-Instruct , SAM-B+SigLIP-L DualMath-1.1M GRPO T,I data reasoning,reward hacking URSA No RM
v1 Qwen2-VL(7B,72B),Qwen2.5-VL(7B,72B) v1g No T,I retrieve regions - No No
GRE Suite Qwen2.5VL(3B,7B,32B) Im2GPS3k,GWS15k GRPO T,I reasoning location LLaMA-Factory Yes RM+Rule-base
V-Triune Qwen2.5-VL-7B-Instruct,Qwen2.5-VL-32B-Instruct mm_math,geometry3k,mmk12,PuzzleVQA,AlgoPuzzleVQA,VisualPuzzles, ScienceQA,SciVQA , ViRL39K,ChartQAPro,ChartX,Table-VQA, ViRL39K, V3Det,Object365, 𝐷3, CLEVR, LLaVA-OV Data, EST-VQA GRPO T,I intensive perception verl Yes RM
RePrompt Qwen2.5 7B GenEva GRPO T,I image generation trl Yes RM
GoT-R1 Qwen2.5VL-7B JourneyDB-GoT,FLUX-GoT GRPO T,I semantic-spatial reasoning no release No RM
SophiaVL-R1 Qwen2.5-VL-7B-Instruct SophiaVL-R1-130k GRPO T,I reasoning-specific,general vision-language understanding VeRL No RM+Rule-base
R1-ShareVL Qwen2.5-VL-7B and Qwen2.5-VL-32B MM-Eureka GRPO T,I General Visual Reasoning EasyR1 No Rule-base
VLM-R^3 Qwen2.5-VL-7B VLIR GRPO T,I Region Recognition and Reasoning DeepSpeed Yes Rule-base
TON Qwen-2.5-VL-Instruct-3B/7B CLEVR,Super-CLEVR,GeoQA,AITZ GRPO T,I spanning counting, mobile agent navigation, and mathematical reasoning vLLM Yes Rule-base
Pixel Reasoner Qwen2.5-VL-7B SA1B,FineWeb and STARQA GRPO T,I pixel-space reasoning OpenRLHF No Rule-base
VARD - SCOPe,Pick-a-Pic,ImageRewardDB No T,I image generation not release No RM
Chain-of-Focus Qwen2.5-VL-7B MM-CoF,SA_1B,TextVQA,m3cot,V⋆,POPE GRPO T,I visual search and reasoning not release Yes Rule-base
Visionary-R1 Qwen2.5-VL-3B A-OKVQA,ChartQA,AI2D,ScienceQA,GeoQA+,DocVQA,CLEVR-Math,Icon-QA,TabMWP,RoBUTSQA,TextVQA GRPO T,I VQA not release No Rule-base
VisualQuality-R1 Qwen2.5-VL-7B KADID-10K,SPAQ GRPO T,I image quality scoring not release No Rule-base
DeepEyes Qwen2.5-VL-7B Fine-grained:V∗ training set Chart:ArxivQA Reasoning:ThinkLite-VL GRPO T, I Multimodal Reasoning verl No Rule-base
Visual-ARFT Qwen2.5-VL(3B,7B) MAT-Search, MAT-Coding,2WikiMultihopQA,HotpotQA,MuSiQue,Bamboogle GRPO T, I Multimodal Agentic Reasoning no release No Rule-base
UniVG-R1 Qwen2-VL-2B 7B MGrounding-630k,RefCOCO/+/g,RefCOCO,MIG-Bench, LISA-Grounding,LLMSeg-Grounding,ReVOS Grounding,ReasonVOS Grounding GRPO T, I,V Visual Grounding (Multi-image Context, Complex Instructions) Open-R1 Yes RM+Rule-base
G1 Qwen2.5-VL-7B a batch size of 128 parallel games and a group size of 5 for 500 training steps per game. GRPO T, I Interactive Game Decision-Making EasyR1 Yes Rule-base
VisionReasoner Seg-Zero? COCO,RefCOCO(+/g) RefCOCO(+/g),ReasonSeg PixMo-Count,CountBench GRPO T, I detection, segmentation, counting no release No Rule-base
GuardReasoner-VL Qwen2.5-VL Instruct 3B and Qwen2.5-VL-Instruct 7B GuardReasoner-VLTrain GRPO(omit the KL divergence loss) T, I Moderation (Prompt & Response Harmfulness Detection) EasyR1 Yes Rule-base
OpenThinkIMG Qwen2-VL-2B-Instruct CHARTGEMMA GRPO T,I Chart Reasoning V-TOOL RL?Open-R1 Yes Rule-base
DanceGRPO Stable Diffusion,HunyuanVideo,FLUX,SkyReels-I2V curated prompt dataset,VidProM GRPO T,I Text-to-Video Generation, Image-to-Video Generation,Text-to-ImageGeneration fastvideo No RM
Flow-GRPO SD3.5-M GenEval,OCR,from pickscore GRPO T,I Composition Image Generation,Visual Text Rendering,Human Preference Alignment no release No RM(pickscore),Rule-base(GenEval,ocr)
X-Reasoner Qwen2.5-VL-7B-Instruct OpenThoughts,Orz-math,MedQA GRPO T,I Generalization across domains and modalities no release No Rule-base
T2I-R1 Janus-Pro-7B T2I-CompBench GRPO T, I Text-to-Image Generation Open-R1 No RM
VIDEO-RTS Qwen2.5-VL-7B-Instruct CG-Bench, 6K MCQA GRPO T, V Video Understanding TRL No Rule-base

3 Towards Native Multimodal Reasoning Model

Large Multimodal Reasoning Models (LMRMs) have demonstrated potential in handling complex tasks with long chain-of-thought. However, their language-centric architectures constrain their effectiveness in real- world scenarios. Specifically, their reliance on vision and language modalities limits their capacity to process and reason over interleaved diverse data types, while their performance in real-time, iterative interactions with dynamic environments remains underdeveloped. These limitations underscore the need for a new class of models capable of broader multimodal integration and more advanced interactive reasoning.

Click to expand N-LMRMs(Agentic Models) table
Model Parameter Input Modality Output Modality Training Strategy Task Characteristic
R1-Searcher 7B, 8B T T RL Multi-Hop QA RL-Enhanced LLM Search
Search-o1 32B T T Training-Free Multi-Hop QA, Math Agentic Search-Augmented Reasoning
DeepResearcher 7B T T RL Multi-Hop QA RL in Live Search Engines
Magma 8B T, I, V T Pretrain Multimodal Understanding, Spatial Reasoning 820K Spatial-Verbal Labeled Data
OpenVLA 7B T, I T SFT Spatial Reasoning 970k Real-World Robot Demonstrations
CogAgent 18B T, I T Pretrain+SFT VQA, GUI navigation Low-High Resolution Encoder Synergy
UI-TARS 2B, 7B, 72B T, I T Pretrain+SFT+RL VQA, GUI navigation End-to-End GUI Reasoning and Action
Seeclick 10B T, I T Pretrain+SFT GUI navigation Screenshot-Based Task Automation
Embodied-Reasoner 7B T, I T, A Pretrain+SFT GUI navigation Image-Text Interleaved Long-Horizon Embodied Reasoning
Seed1.5-VL 20B T, I, V T Pretrain+SFT+RL GUI, Multimodal Understanding and Reasoning General-purpose Multimodal Understanding and Reasoning with Iterative Reinforcement Learning
RIG 1.4B (Janus) T, I T, A, I Pretrain + SFT + Imagination Alignment Minecraft Embodied Tasks, Image Generation, Reasoning Synergized Reasoning & Imagination, End-to-End Generalist Policy, 17× Sample Efficiency, Lookahead Self-Correction
Click to expand N-LMRMs(Omni-Modal Models) table
Model Parameter Input Modality Output Modality Training Strategy Task Characteristic
Gemini 2.0 & 2.5 / T, I, A, V T, I, A / / /
GPT-4o / T, I, A, V T, I / / /
Megrez-3B-Omni 3B T, I, A T Pretrain+SFT VQA, OCR, ASR, Math, Code Multimodal Encoder-Connector-LLM
Qwen2.5-Omni 7B T, I, A, V T, A Pretrain+SFT VQA, OCR, ASR, Math, Code Time-Aligned Multimodal RoPE
Baichuan-Omni-1.5 7B T, I, A, V T, A Pretrain+SFT VQA, OCR, ASR, Math, GeneralQA Leading Medical Image Understanding
M2-omni 9B, 72B T, I, A, V T, I, A Pretrain+SFT VQA, OCR, ASR, Math, GeneralQA Step Balance For Pretraining and Adaptive Balance For SFT
MiniCPM-o 2.6 8B T, I, A, V T, A Pretrain+SFT+RL VQA, OCR, ASR, AST Parallel Multimodal Streaming Processing
Mini-Omni2 0.5B T, I, A A Pretrain+SFT VQA, ASR, AQA, GeneralQA Real-Time and End-to-End Voice Response
R1-Omni 0.5B T, A, V T RL Emotion Recognition RL with Verifiable Reward
Janus-Pro 1B, 7B T, I T, I Pretrain+SFT Multimodal Understanding, Text-to-Image Decoupling Visual Encoding For Understanding and Generation
AnyGPT 7B T, I, A T, I, A Pretrain Multimodal-to-Text and Text-to-Multimodal Discrete Representations For Unified Processing
Uni-MoE 13B, 20B, 22B, 37B T, I, A, V T Pretrain+SFT VQA, AQA Modality-Specific Encoders with Connectors for Unified Representation
Ovis-U1 3B T, I T, I Pretrain+SFT Multimodal understanding, T2I, Image Editing Unified training from LLM and diffusion decoder with token refiner
ShapeLLM-Omni 7B 3D, I, T 3D, T Pre-trained+SFT Text-to-3D, Image-to-3D, 3D understanding, interactive 3D editing Uses 3D VQVAE to tokenize meshes for a unified autoregressive framework
Ming-Omni 2.8B T, I, A, V I, T, A Pretrain+SFT Multimodal understanding & generation MoE LLM with modality-specific routers; connects specialized decoders to a frozen perception core.
BAGEL 14B (7B active) T, I, V I, T, V Pretrain+CT+SFT Multimodal understanding & generation Unified decoder-only MoT architecture

3.1 Evaluation of O3 and O4-mini

reasoning
Figure 5: Case study of OpenAI o3’s long multimodal chain-of-thought, reaching the correct answer after 8 minutes and 13 seconds of reasoning.
visual processing
Figure 6: Case study of OpenAI o3: Find locations, solve a puzzle and create multimedia contents.
multimedia
Figure 7: Case study of OpenAI o3: Visual problem solving and file processing.

3.2 Model Capability

3.3 Technical Prospect

progress
Figure 8: Overview of next-generation native large multimodal reasoning model. The envisioned system aims to achieve comprehensive perception across diverse real-world data modalities, enabling precise omnimodal understanding and in-depth generative reasoning. This foundational model will lead to more advanced forms of intelligent behavior, learning from world experience and realizing lifelong learning and self-improvement.

4 Dataset and Benchmark

progress
Figure 9: The outlines of datasets and benchmarks. We reorganize the multimodal datasets and benchmarks into four main categories: Understanding, Generation, Reasoning, and Planning.

Click to expand Datasets and Benchmarks

4.1 Multimodal Understanding

4.1.1 Visual-Centric Understanding

Benchmark Dataset
VQA, GQA, DocVQA, TextVQA ALIGN, LTIP, YFCC100M, DocVQA
OCR-VQA, CMMLU, C-Eval, MTVQA Visual Genome, YouTube8M, CC3M, ActivityNet-QA
Perception-Test, Video-MMMU, Video-MME, MMBench SBU-Caption, AI2D, LAION-5B, LAION-400M
Seed-Bench, MME-RealWorld, MMMU, MM-Vet MS-COCO, Virpt, OpenVid-1M, VidGen-1M
MMT-Bench, Hallu-PI, ColorBench, DVQA Flickr30k, COYO-700M, WebVid, Youku-mPLUG
MMStar, TRIG-Bench, MM-IFEval, All-Angles Bench VideoCC3M, FILIP, CLIP, TikTalkCoref
Wukong, 4D-Bench, DVBench, EIBench EarthScape, MRES-32M
FAVOR-Bench, H2VU-Benchmark, HIS-Bench, IV-Bench
MMCR-Bench, MMSciBench, PM4Bench, ProBench
Chart-HQA, CliME, DomainCQA
FlowVerse, Kaleidoscope, MAGIC-VQA, MME-Unify
MMLA, Misleading ChartQA, NoTeS-Bank
OWLViz, RISEBench, RSMMVP, RefCOCOm
SARLANG-1M, SBVQA, STI-Bench, TDBench
V2P-Bench, VidDiffBench, Video-MMLU, ColorBench
VideoComp, VideoVista-CulturalLingo, VisNumBench, WikiVideo
XLRS-Bench, AgMMU, CausalVQA,FedVLMBench
SeriesBench,WebUIBench, MLLM-CL, MERIT
A4Bench, VLM@school, UnLOK-VQA, DocMark
KnowRecall and VisRecall, EmotionHallucer

4.1.2 Audio-Centric Understanding

4.2 Multimodal Generation

4.2.1 Cross-modal Generation

4.2.2 Joint Multimodal Generation

4.3 Multimodal Reasoning

4.3.1 General Visual Reasoning

4.3.2 Domain-specific Reasoning

4.4 Multimodal Planning

4.4.1 GUI Navigation

4.4.2 Embodied and Simulated Environments

4.5 Evaluation Method

5 Conclusion

In this paper, we survey the evolution of multimodal reasoning models, highlighting pivotal advancements and paradigm-shifting milestones in the field. While current models predominantly adopt a language-centric reasoning paradigm—delivering impressive results in tasks like visual question answering and text-image retrieval—critical challenges persist. Notably, visual-centric long reasoning (e.g., understanding object relations or 3D contexts, addressing visual information seeking questions) and interactive multimodal reasoning (e.g., dynamic cross-modal dialogue or iterative feedback loops) remain underdeveloped frontiers requiring deeper exploration.

Building on empirical evaluations and experimental insights, we propose a forward-looking framework for inherently multimodal large models that transcend language-dominated architectures. Such models should prioritize three core capabilities:

  1. Multimodal Agentic Reasoning: Enabling proactive environmental interaction (e.g., embodied AI agents that learn through real-world trial and error)
  2. Omni-Modal Understanding and Generative Reasoning:
    • Integrating any-modal semantics (e.g., aligning abstract concepts across vision, audio, and text) while resolving ambiguities in complex, open-world contexts
    • Producing coherent, context-aware outputs across modalities (e.g., generating diagrams from spoken instructions or synthesizing video narratives from text)

By addressing these dimensions, future models could achieve human-like contextual adaptability, bridging the gap between isolated task performance and generalized, real-world problem-solving.

Acknowledge

We express sincere gratitude for the valuable contributions of all researchers and students involved in this work.

We welcome the community to contribute to the development of this survey, and we will regularly update it to reflect the latest research.

Please feel free to submit issues or contact us via email at liyunxin987@163.com.

Github Star

Star History

About

The development and future prospects of large multimodal reasoning models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5