A curated, taxonomy-driven collection of language-conditioned robot manipulation papers, code, simulators, and benchmarks – tracking the literature behind
“Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation” (arXiv:2312.10807).
If you find this repo useful, please:
- ⭐ Star the repo
- 👀 Watch for updates
- 🧑💻 Open a PR to add missing papers or fixes
so more people can discover and build on this survey!
- [November 20, 2025] Further extension of the survey paper with new structure (language roles taxonomy) and more recent works (2024–2025).
- [November 30, 2024] Extended survey paper is available.
- [October 02, 2024] Cutting-edge papers in 2024 are available.
- Survey Paper
- Taxonomy: How Language Bridges Perception and Control
- Language for State Evaluation
- Language as Policy Conditions
- Language for Cognitive Planning and Reasoning
- Comparative Analysis
- Citation
This repository is built around the survey:
Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation
Hongkuan Zhou, Xiangtong Yao, Oier Mees, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll
The structure of the survey organizes methods by the role language plays in the system.
At a high level, language can:
-
Evaluate what the robot is doing
-> Language for state evaluation Language becomes a reward, cost, or scoring function, used to measure task progress, preferences, or goal satisfaction. -
Specify how the robot should act
-> Language as a policy condition (Sec. 5)
Language is fed directly into the policy, shaping the action distribution at each step (e.g., language-conditioned RL, BC, diffusion policies). -
Help the robot think and plan
-> Language for cognitive planning and reasoning (Sec. 6)
Language is used as an internal reasoning medium: planning, decomposition, querying knowledge bases, or manipulating symbolic structures.
Below, we briefly summarize each role and show how it maps to the sections in this repo.
- Zero-Shot Reward Specification via Grounded Natural Language [paper]
- Trajectory Improvement and Reward Learning from Comparative Language Feedback [paper][code]
- PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards [paper]
- Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization [paper]
- From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following [paper]
- Grounding English Commands to Reward Functions [paper]
- Model-Based Inverse Reinforcement Learning from Visual Demonstrations [paper]
- From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following [paper]
- Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [paper][code]
- Reward Design with Language Models [paper] [code]
- Language Reward Modulation for Pretraining Reinforcement Learning [paper] [code]
- Language to Rewards for Robotic Skill Synthesis [paper] [code]
- RoboGen: towards unleashing infinite data for automated robot learning via generative simulation [paper] [code]
- Text2Reward: Reward Shaping with Language Models for Reinforcement Learning [paper] [code]
- Eureka: Human-Level Reward Design via Coding Large Language Models [paper] [code]
- Learning reward for robot skills using large language models via self-alignment [paper] [code]
- R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models [paper]
- RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models [paper]
- ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics [paper]
- Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics [paper] [code]
- Guiding reinforcement learning with shaping rewards provided by the vision–language model [paper]
- ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations [paper][code]
- Correcting Robot Plans with Natural Language Feedback [paper]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper] [code]
- Language-Conditioned Path Planning [paper] [code]
- IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models [paper] [code]
- Language-Conditioned Goal Generation: a New Approach to Language Grounding for RL [paper]
- LanCon-Learn: Learning With Language to Enable Generalization in Multi-Task Manipulation [paper]
- Meta-Reinforcement Learning via Language Instructions [paper] [code]
- Learning from Symmetry: Meta-Reinforcement Learning with Symmetrical Behaviors and Language Instructions [paper] [[code]]
- Natural Language Instruction-following with Task-related Language Development and Translation [paper]
- Task-Oriented Language Grounding for Robot via Learning Object Mask [paper]
- Preserving and combining knowledge in robotic lifelong reinforcement learning [paper]
- FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning [paper] [code]
- Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance [paper] [code]
- LIMT: Language-Informed Multi-Task Visual World Models [paper]
- Pay Attention! - Robustifying a Deep Visuomotor Policy Through Task-Focused Visual Attention [paper]
- Language-Conditioned Imitation Learning for Robot Manipulation Tasks [paper]
- CLIPORT: What and Where Pathways for Robotic Manipulation [paper] [code]
- Language Conditioned Imitation Learning over Unstructured Data [paper] [code]
- BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning [paper] [code]
- MimicPlay: Long-Horizon Imitation Learning by Watching Human Play [paper] [code]
- Instruction-driven history-aware policies for robotic manipulations [paper] [code]
- PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation [paper] [code]
- Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [paper] [code]
- GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields [paper] [code]
- RVT: Robotic View Transformer for 3D Object Manipulation [paper] [code]
- Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation [paper] [code]
- What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data [paper] [code]
- Grounding Language with Visual Affordances over Unstructured Data [paper] [code]
- Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware [paper] [code]
- RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [paper] [code]
- SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning [paper]
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion [paper] [code]
- Imitating Human Behaviour with Diffusion Models [paper] [code]
- Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects [paper]
- Octo: An Open-Source Generalist Robot Policy [paper] [code]
- Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics [paper]
- The Ingredients for Robotic Diffusion Transformers [paper] [code]
- ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation [paper] [code]
- DNAct: Diffusion Guided Multi-Task 3D Policy Learning [paper]
- Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning [paper]
- 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [paper] [code]
- Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks [paper]
- Inference-Time Policy Steering Through Human Interactions [paper] [code]
- Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach [paper] [code]
- Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [paper] [code]
- Generative Skill Chaining: Long-Horizon Skill Planning with Diffusion Models [paper] [code]
- Goal-Conditioned Imitation Learning using Score-based Diffusion Policies [paper] [code]
- PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play [paper]
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition [paper] [code]
- Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals [paper] [code]
- DISCO: Language-Guided Manipulation With Diffusion Policies and Constrained Inpainting [paper]
- 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [paper] [code]
- Language Control Diffusion: Efficiently Generalizing through Space, Time, and Tasks [paper] [code]
- PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play [paper]
- Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning [paper]
- Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation [paper]
- StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects [paper] [code]
- PoCo: Policy Composition from and for Heterogeneous Robot Learning [paper]
- RoLD: Robot Latent Diffusion for Multi-task Policy Modeling [paper] [code]
- GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy [paper] [code]
-
Hierarchical understanding in robotic manipulation: A knowledge-based framework [paper]
-
Semantic Grasping Via a Knowledge Graph of Robotic Manipulation: A Graph Representation Learning Approach [paper]
-
Knowledge Acquisition and Completion for Long-Term Human-Robot Interactions using Knowledge Graph Embedding [paper]
-
Tell me dave: Context-sensitive grounding of natural language to manipulation instructions [paper]
-
Neuro-symbolic procedural planning with commonsense prompting [paper]
-
Reinforcement Learning Based Navigation with Semantic Knowledge of Indoor Environments [paper]
-
Learning Neuro-Symbolic Skills for Bilevel Planning [[paper]](Learning Neuro-Symbolic Skills for Bilevel Planning)
-
Learning Neuro-symbolic Programs for Language Guided Robot Manipulation [paper] [code]
-
Long-term robot manipulation task planning with scene graph and semantic knowledge [paper]
-
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition [paper] [code]
-
ProgPrompt: program generation for situated robot task planning using large language models [paper]
-
Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback [paper] [code]
-
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills [paper] [code]
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper] [code]
- Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [paper] [code]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [[paper]](Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents) [code]
- Embodied Task Planning with Large Language Models [paper] [code]
- Text2Motion: from natural language instructions to feasible plans [paper]
- AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation [paper]
- Learning to reason over scene graphs: a case study of finetuning GPT-2 into a robot language model for grounded task planning [paper]
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning [paper]
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition [paper] [code]
- Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
- Language Models as Zero-Shot Trajectory Generators [paper] [code]
- SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models [paper] [code]
- Human–robot interaction through joint robot planning with large language models [paper]
- Rearrangement: A Challenge for Embodied AI [paper] [code]
- The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI [paper]
- Housekeep: Tidying Virtual Households Using Commonsense Reasoning [paper] [code]
- TidyBot: personalized robot assistance with large language models [paper] [code]
- Building Cooperative Embodied Agents Modularly with Large Language Models [paper] [code]
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [paper] [code]
- Robotic Control via Embodied Chain-of-Thought Reasoning [paper] [code]
- Training Strategies for Efficient Embodied Reasoning [paper]
- Scaling up and distilling down: Language-guided robot skill acquisition [paper] [code]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [paper]
- Code as Policies: Language Model Programs for Embodied Control [paper] [code]
- ProgPrompt: program generation for situated robot task planning using large language models [paper] [code]
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper] [code]
- Alchemist: LLM-Aided End-User Development of Robot Applications [paper]
- RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [paper] [code]
- Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback [paper] [code]
- Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills [paper]
- Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
- REFLECT: Summarizing Robot Experiences for FaiLure Explanation and CorrecTion [paper] [code]
- HiCRISP: An LLM-based Hierarchical Closed-Loop Robotic Intelligent Self-Correction Planner [paper] [code]
- Autonomous Interactive Correction MLLM for Robust Robotic Manipulation [paper] [code]
- Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation [paper]
- Neuro-Symbolic Procedural Planning with Commonsense Prompting [paper] [code]
- Hierarchical Understanding in Robotic Manipulation: A Knowledge-Based Framework [paper]
- RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation [paper] [code]
- Robot Task Planning and Situation Handling in Open Worlds [paper] [code]
- Translating Natural Language to Planning Goals with Large-Language Models [paper]
- A framework for neurosymbolic robot action planning using large language models [paper] [code]
- Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation [paper] [code]
- LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks [paper]
- Bootstrapping Object-Level Planning with Large Language Models [paper] [code]
- A survey of Behavior Trees in robotics and AI [paper]
- LLM-BT: Performing Robotic Adaptive Tasks based on Large Language Models and Behavior Trees [paper] [code]
- Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions [paper] [code]
- Automatic Behavior Tree Expansion with LLMs for Robotic Manipulation [paper] [code]
- LLM-as-BT-Planner: Leveraging LLMs for Behavior Tree Generation in Robot Task Planning [paper] [code]
- CLIPORT: What and Where Pathways for Robotic Manipulation [paper] [code]
- Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models [paper] [code]
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [paper] [code]
- Simple but Effective: CLIP Embeddings for Embodied AI [paper] [code]
- Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [paper] [code]
- Language Reward Modulation for Pretraining Reinforcement Learning [paper] [code]
- R3M: A Universal Visual Representation for Robot Manipulation [paper] [code]
- Open-World Object Manipulation using Pre-Trained Vision-Language Models [paper]
- Simple Open-Vocabulary Object Detection with Vision Transformers [paper] [code]
- Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models [paper]
- Pretrained Language Models as Visual Planners for Human Assistance [paper] [code]
- Learning Universal Policies via Text-Guided Video Generation [paper]
- Learning to reason over scene graphs: a case study of finetuning GPT-2 into a robot language model for grounded task planning [paper]
- RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics [paper] [code]
- Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models [paper] [code]
- PaLM-E: An Embodied Multimodal Language Model [paper]
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [paper] [code]
- PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [paper]
- DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics [paper]
- Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models [paper] [code]
- Semantically controllable augmentations for generalizable robot learning [paper]
- GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy [paper] [code]
- General Flow as Foundation Affordance for Scalable Robot Learning[paper] [code]
- EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos [paper] [code]
- H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation [paper]
- π0: A Vision-Language-Action Flow Model for General Robot Control [paper]
- RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [[paper]](RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation)
- Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [paper] [code]
- SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [paper] [code]
- PointVLA: Injecting the 3D World into Vision-Language-Action Models [paper]
- BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [paper] [code]
- GeoVLA: Empowering 3D Representations in Vision-Language-Action Models [paper]
- VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation [paper]
- Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [paper]
- OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing [paper]
- Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [paper] [code]
- ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [paper] [code]
- LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks [paper]
- DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [paper] [code]
- Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation [paper]
- MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation [paper]
- Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning [paper]
- ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper]
- Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [paper]
- InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation [paper]
- GR-3 Technical Report [paper]
- Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation [paper] [code]
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper]
- WorldVLA: Towards Autoregressive Action World Model [paper] [code]
- DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [paper] [code]
- π0: A vision-language-action flow model for general robot control. [paper]
- FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper] [code]
- π0.5: a Vision-Language-Action Model with Open-World Generalization [paper]
- Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [paper]
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [paper] [code]
- ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models [paper] [code]
- ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy [paper]
- Interactive Post-Training for Vision-Language-Action Models [paper]
- Reinforcement Learning for Long-Horizon Interactive LLM Agents [paper]
| Optimization (Direction) | Article | Time | Observation | Action Generation | CoT | FP | MEM | MD | Pretraining CE | Scenarios MS | Scenarios RW | Execution CE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data Source Augmentation | EgoVLA | 2025-07 | RGB, ROB, TX | DP | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| H-RDT | 2025-08 | RGB, ROB, TX | FM | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Shortcut | 2025-08 | RGB, TX | - | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | |
| Spatial Understanding | SpatialVLA | 2025-01 | RGB, ROB, TX | AR | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| PointVLA | 2025-05 | RGBD, ROB, TX | DM | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| BridgeVLA | 2025-06 | RGB, ROB, TX | DP | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | |
| GeoVLA | 2025-08 | RGBD, ROB, TX | DM | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | |
| Multimodal Sensing & Fusion | VTLA | 2025-05 | RGB, TX | AR | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Tactile-VLA | 2025-07 | RGB, ROB, TX | FM | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | |
| OmniVTLA | 2025-08 | RGB, ROB, TX | FM | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | |
| ForceVLA | 2025-09 | RGBD, ROB, TX | FM | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | |
| FuSe VLA | 2025-01 | RGBD, ROB, TX | AR | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | |
| Long-horizon task solving | LoHoVLA | 2025-05 | RGB, ROB, TX | AR | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| Long-VLA | 2025-08 | RGB, TX | DM | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | |
| DexVLA | 2025-08 | RGB, ROB, TX | DM | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| MemoryVLA | 2025-08 | RGB, TX | DM | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| DiffusionVLA | 2024-12 | RGBD, TX | DM + AR | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | |
| Knowledge Preserving | ChatVLA | 2025-02 | RGB, TX | DM | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ |
| ChatVLA-2 | 2025-05 | RGB, TX | DM | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | |
| Insulating | 2025-05 | RGB, ROB, TX | FM + AR | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| InstructVLA | 2025-07 | RGB, ROB, TX | FM | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | |
| GR-3 | 2025-07 | RGB, ROB, TX | FM | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | |
| Reasoning & World Models | Seer | 2024-12 | RGB, ROB, TX | DP | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| CoT-VLA | 2025-05 | RGB, ROB, TX | AR | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| WorldVLA | 2025-06 | RGB, ROB, TX | AR | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |
| DreamVLA | 2025-08 | RGB, ROB, TX | DM | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | |
| [ECoT VLA] (https://openreview.net/forum?id=S70MgnIA0v) | 2024-07 | RGB, TX | AR | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | |
| Policy Execution | PI-0 | 2024-10 | RGB, ROB, TX | FM | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| PI-Fast | 2025-01 | RGB, ROB, TX | FM | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| PI-0.5 | 2025-04 | RGB, ROB, TX | FM | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| DisDiffVLA | 2025-08 | RGB, ROB, TX | DM | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | |
| Adaptation & Fine-Tuning | OpenVLA-OFT | 2025-02 | RGB, ROB, TX | DP | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ConRFT | 2025-04 | RGB, ROB, TX | DM | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | |
| RIPT-VLA | 2025-05 | RGB, ROB, TX | AR | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | |
| ControlVLA | 2025-06 | RGB, ROB, TX | DM | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
Notes
- Observation: RGB images; D = depth information (e.g., point clouds); ROB = robot proprioception; TX = text (e.g., prompt, language goal).
- Action generation: FM = Flow Matching; DM = Diffusion Model; AR = Autoregressive; DP = Direct Prediction; - = Not Applicable.
- CoT = Chain-of-Thought; FP = Future Prediction; MEM = Memory Mechanisms; MD = Multiple Datasets.
- Pretraining CE = Cross-embodiment Data; MS = Multi-scenario; RW = Real-world Deployment; Execution CE = Cross-embodiment Execution.
| Benchmark | Simulation Engine or Real-world Dataset | Embodiment | Data Size | RGB | Depth | Masks | Tool used | Multi-agents | Long-horizon |
|---|---|---|---|---|---|---|---|---|---|
| CALVIN | PyBullet | Franka Panda | 2400k | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Meta-world | MuJoCo | Sawyer | - | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| RLBench | CoppeliaSim | Franka Panda | - | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| VIMAbench | PyBullet | UR5 | 650k | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
| LoHoRavens | PyBullet | UR5 | 15k | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| ARNOLD | NVIDIA Omniverse | Framka Panda | 10k | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| RoboGen | PyBullet | Multiple | - | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
| LIBERO | MuJoCo | Franka Panda | 6.5k | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Open X-Embodiment | Real-world Dataset | Multiple | 2419k | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| DROID | Real-world Dataset | Franka Panda | 76k | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Galaxea Open-world | Real-world Dataset | Galaxea R1 Lite | 100k | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Models | Years | Benchmark | Simulation Engine | Language Module | Perception Module | Real world experiments | FMs | RL | IL | MP |
|---|---|---|---|---|---|---|---|---|---|---|
| IPRO | 2018 | # | - | LSTM | CNN | ✅ | ❌ | ❌ | ❌ | ✅ |
| MaestROB | 2018 | # | - | IBM Watson | Artoolkit | ✅ | ❌ | ❌ | ❌ | ✅ |
| exePlan | 2018 | # | - | coreNLP | * | ✅ | ❌ | ❌ | ❌ | ✅ |
| TLC | 2018 | # | - | CCG | * | ✅ | ❌ | ❌ | ✅ | ❌ |
| Cut&recombine | 2019 | # | - | Parser | * | ✅ | ❌ | ❌ | ❌ | ✅ |
| DREAMCELL | 2019 | # | - | LSTM | * | ❌ | ❌ | ❌ | ✅ | ❌ |
| ICR | 2019 | # | - | Parser, DCG | YOLO9000 | ✅ | ❌ | ❌ | ❌ | ✅ |
| GroundedDA | 2019 | # | - | CCG | RANSAC | ✅ | ❌ | ❌ | ❌ | ✅ |
| MEC | 2020 | # | - | Parser, ADCG | Mask RCNN | ✅ | ❌ | ❌ | ❌ | ✅ |
| LMCR | 2020 | # | - | RNN | Mask RCNN | ✅ | ❌ | ❌ | ❌ | ✅ |
| PixL2R | 2020 | Meta-World | MuJoCo | LSTM | CNN | ❌ | ❌ | ✅ | ❌ | ❌ |
| Concept2Robot | 2020 | # | PyBullet | BERT | ResNet-18 | ❌ | ❌ | ❌ | ✅ | ❌ |
| LanguagePolicy | 2020 | # | CoppeliaSim | GLoVe | Faster RCNN | ❌ | ❌ | ❌ | ✅ | ❌ |
| LOReL | 2021 | Meta-World | MuJoCo | distillBERT | CNN | ✅ | ❌ | ✅ | ❌ | ❌ |
| CARE | 2021 | Meta-World | MuJoCo | RoBERTa | * | ❌ | ✅ | ✅ | ❌ | ❌ |
| MCIL | 2021 | # | MuJoCo | MUSE | CNN | ❌ | ❌ | ❌ | ✅ | ❌ |
| BC-Z | 2021 | # | - | MUSE | ResNet18 | ✅ | ❌ | ❌ | ✅ | ❌ |
| CLIPort | 2021 | # | PyBullet | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ | ❌ |
| LanCon-Learn | 2022 | Meta-World | MuJoCo | GLoVe | * | ❌ | ❌ | ✅ | ✅ | ❌ |
| MILLION | 2022 | Meta-World | MuJoCo | GLoVe | * | ✅ | ❌ | ✅ | ❌ | ❌ |
| PaLM-SayCan | 2022 | # | - | PaLM | ViLD | ✅ | ✅ | ✅ | ✅ | ❌ |
| ATLA | 2022 | # | PyBullet | BERT-Tiny | CNN | ❌ | ✅ | ✅ | ❌ | ❌ |
| HULC | 2022 | CALVIN | PyBullet | MiniLM-L3-v2 | CNN | ❌ | ❌ | ❌ | ✅ | ❌ |
| PerAct | 2022 | RLbench | CoppelaSim | CLIP | ViT | ✅ | ❌ | ❌ | ✅ | ❌ |
| RT-1 | 2022 | # | - | USE | EfficientNet-B3 | ✅ | ✅ | ❌ | ✅ | ❌ |
| LATTE | 2023 | # | CoppeliaSim | distillBERT, CLIP | CLIP | ✅ | ❌ | ❌ | ❌ | ✅ |
| DIAL | 2022 | # | - | CLIP | CLIP | ✅ | ✅ | ❌ | ✅ | ❌ |
| R3M | 2022 | # | - | distillBERT | ResNet18,34,50 | ✅ | ❌ | ❌ | ✅ | ❌ |
| Inner Monologue | 2022 | # | - | CLIP | CLIP | ✅ | ✅ | ❌ | ❌ | ✅ |
| NLMap | 2023 | # | - | CLIP | ViLD | ✅ | ✅ | ❌ | ✅ | ❌ |
| Code as Policies | 2023 | # | - | GPT3, Codex | ViLD | ✅ | ✅ | ❌ | ❌ | ✅ |
| Progprompt | 2023 | Virtualhome | Unity3D | GPT-3 | * | ✅ | ✅ | ❌ | ❌ | ✅ |
| Language2Reward | 2023 | # | MuJoCo MPC | GPT-4 | * | ✅ | ✅ | ✅ | ❌ | ❌ |
| LfS | 2023 | Meta-World | MuJoCo | Cons. Parser | * | ✅ | ❌ | ✅ | ❌ | ❌ |
| HULC++ | 2023 | CALVIN | PyBullet | MiniLM-L3-v2 | CNN | ✅ | ❌ | ❌ | ✅ | ❌ |
| ALOHA | 2023 | # | - | Transformer | CNN | ✅ | ❌ | ❌ | ✅ | ❌ |
| LEMMA | 2023 | LEMMA | NVIDIA Omniverse | CLIP | CLIP | ❌ | ❌ | ❌ | ✅ | ❌ |
| SPIL | 2023 | CALVIN | PyBullet | MiniLM-L3-v2 | CNN | ✅ | ❌ | ❌ | ✅ | ❌ |
| PaLM-E | 2023 | # | PyBullet | PaLM | ViT | ✅ | ✅ | ❌ | ✅ | ❌ |
| LAMP | 2023 | RLbench | CoppelaSim | ChatGPT | R3M | ❌ | ✅ | ✅ | ❌ | ❌ |
| MOO | 2023 | # | - | OWL-ViT | OWL-ViT | ✅ | ❌ | ❌ | ✅ | ❌ |
| Instruction2Act | 2023 | VIMAbench | PyBullet | ChatGPT | CLIP | ❌ | ✅ | ❌ | ❌ | ✅ |
| VoxPoser | 2023 | # | SAPIEN | GPT-4 | OWL-ViT | ✅ | ✅ | ❌ | ❌ | ✅ |
| SuccessVQA | 2023 | # | IA Playroom | Flamingo | Flamingo | ✅ | ✅ | ❌ | ✅ | ❌ |
| VIMA | 2023 | VIMAbench | PyBullet | T5 | ViT | ✅ | ✅ | ❌ | ✅ | ❌ |
| TidyBot | 2023 | # | - | GPT-3 | CLIP | ✅ | ✅ | ❌ | ❌ | ✅ |
| Text2Motion | 2023 | # | - | GPT-3, Codex | * | ✅ | ✅ | ✅ | ❌ | ❌ |
| LLM-GROP | 2023 | # | Gazebo | GPT-3 | * | ✅ | ✅ | ❌ | ❌ | ✅ |
| Scaling Up | 2023 | # | MuJoCo | CLIP, GPT-3 | ResNet-18 | ✅ | ✅ | ❌ | ✅ | ❌ |
| Socratic Models | 2023 | # | - | RoBERTa, GPT-3 | CLIP | ✅ | ✅ | ❌ | ❌ | ✅ |
| SayPlan | 2023 | # | - | GPT-4 | * | ✅ | ✅ | ❌ | ❌ | ✅ |
| RT-2 | 2023 | # | - | PaLI-X, PaLM-E | PaLI-X, PaLM-E | ✅ | ✅ | ❌ | ✅ | ❌ |
| KNOWNO | 2023 | # | PyBullet | PaLM-2L | * | ✅ | ✅ | ❌ | ❌ | ✅ |
| MDT | 2023 | CALVIN | PyBullet | CLIP | CLIP | ❌ | ❌ | ❌ | ✅ | ❌ |
| RT-Trajectory | 2023 | # | - | PaLM-E | EfficientNet-B3 | ✅ | ✅ | ❌ | ✅ | ❌ |
| SuSIE | 2023 | CALVIN | PyBullet | InstructPix2Pix(GPT3) | InstructPix2Pix | ✅ | ✅ | ❌ | ✅ | ❌ |
| Playfusion | 2023 | CALVIN | PyBullet | Sentence-bert | ResNet-18 | ✅ | ❌ | ❌ | ✅ | ❌ |
| ChainedDiffuser | 2023 | RLbench | CoppelaSim | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ | ❌ |
| GNFactor | 2023 | RLbench | CoppelaSim | CLIP | NeRF | ✅ | ❌ | ❌ | ✅ | ❌ |
| StructDiffusion | 2023 | # | PyBullet | Sentence-bert | PCT | ✅ | ❌ | ❌ | ❌ | ✅ |
| PoCo | 2024 | Fleet-Tools | Drake | T5 | ResNet-18 | ✅ | ❌ | ❌ | ✅ | ❌ |
| DNAct | 2024 | RLbench | CoppelaSim | CLIP | NeRF, PointNext | ✅ | ❌ | ❌ | ✅ | ❌ |
| 3D Diffuser Actor | 2024 | CALVIN | PyBullet | CLIP | CLIP | ✅ | ❌ | ❌ | ✅ | ❌ |
| RoboFlamingo | 2024 | CALVIN | Pybullet | OpenFlamingo | OpenFlamingo | ❌ | ✅ | ❌ | ✅ | ❌ |
| OpenVLA | 2024 | Open X-Embodiment | - | Llama 2 7B | DINOv2 & SigLIP | ✅ | ✅ | ❌ | ✅ | ❌ |
| RT-X | 2024 | Open X-Embodiment | - | PaLI-X,PaLM-E | PaLI-X,PaLM-E | ✅ | ✅ | ❌ | ✅ | ❌ |
| PIVOT | 2024 | Open X-Embodiment | - | GPT-4, Gemini | GPT-4, Gemini | ✅ | ✅ | ❌ | ❌ | ✅ |
| RT-Hierarchy | 2024 | # | - | PaLI-X | PaLI-X | ✅ | ✅ | ❌ | ✅ | ❌ |
| 3D-VLA | 2024 | RL-Bench & CALVIN | CoppeliaSim & PyBullet | 3D-LLM | 3D-LLM | ❌ | ✅ | ❌ | ✅ | ❌ |
| Octo | 2024 | Open X-Embodiment | - | T5 | CNN | ✅ | ✅ | ❌ | ✅ | ❌ |
| ECoT | 2024 | BridgeData V2 | - | Llama 2 7B | DinoV2 & SigLIP | ✅ | ✅ | ❌ | ✅ | ❌ |
| LEGION | 2024 | Meta-World | MuJoCo | RoBERTa | * | ✅ | ❌ | ✅ | ❌ | ❌ |
| RACER | 2024 | RLbench | CoppelaSim | Llama3-llava-next-8B | LLaVA | ✅ | ✅ | ❌ | ✅ | ❌ |
| Ground4Act | 2024 | # | Gazebo | Transformer | ResNet101, BERT | ✅ | ❌ | ✅ | ❌ | ❌ |
| LOVM | 2024 | # | - | BiGRU | LOVM | ❌ | ❌ | ✅ | ❌ | ✅ |
| ECLAIR | 2024 | # | - | GPT-3-turbo | * | ✅ | ✅ | ✅ | ❌ | ❌ |
| PR2L | 2024 | MineDojo | HM3D | InstructBLIP | InstructBLIP | ✅ | ✅ | ✅ | ❌ | ❌ |
| AHA | 2024 | RLBench, ManiSkill | CoppeliaSim, SAPIEN | LLaMA-2-13B | CLIP | ❌ | ✅ | ✅ | ❌ | ✅ |
| KOI | 2024 | Meta-World, LIBERO | MuJoCo | GPT-4v | KOI | ✅ | ✅ | ❌ | ✅ | ✅ |
| GPT-4V(ISION) | 2024 | # | - | GPT-4 | GPT-4 | ✅ | ✅ | ❌ | ✅ | ❌ |
| HiRT | 2024 | Meta-World, Franka-Kitchen | MuJoCo, | InstructBLIP | CNN | ✅ | ✅ | ❌ | ✅ | ❌ |
| Sentinel | 2024 | # | - | GPT-4o | PointNet++ | ✅ | ✅ | ❌ | ❌ | ✅ |
| RoLD | 2024 | Open X-E, Robomimic, Meta-World | -, MuJoCo | DistilBERT | DistilBERT | ❌ | ❌ | ❌ | ❌ | ✅ |
| ITS | 2025 | * | - | LLaMA | A2C | ❌ | ✅ | ✅ | ❌ | ✅ |
| SIAMS | 2025 | Miniworld | Pyglet | LTL | CNN | ❌ | ✅ | ✅ | ❌ | ❌ | |
| CRTO | 2025 | Continual World | MuJoCo | ChatGPT | * | ❌ | ✅ | ✅ | ❌ | ✅ |
| LAMARL | 2025 | # | - | OpenAI | MADDPG | ✅ | ✅ | ✅ | ❌ | ✅ |
| ARCHIE | 2025 | # | - | GPT-4 | * | ✅ | ✅ | ✅ | ❌ | ✅ |
| RealBEF | 2025 | Meta-World | MuJoCo | ALBEF | CNN | ❌ | ✅ | ✅ | ❌ | ❌ |
| LLMRewardShaping | 2025 | Meta-World | MuJoCo | GPT-4 | * | ✅ | ✅ | ✅ | ❌ | ❌ |
| BOSS | 2025 | LIBERO | MuJoCo | OpenVLA | ResNet | ❌ | ✅ | ❌ | ✅ | ✅ |
| LAV-ACT | 2025 | # | MuJuCo | Voltron | Voltron | ✅ | ❌ | ❌ | ✅ | ❌ |
| TPM | 2025 | # | MuJuCo | GPT-4 | ResNet | ✅ | ✅ | ❌ | ✅ | ✅ |
| Mamba | 2025 | # | - | Mamba | Mamba | ✅ | ✅ | ❌ | ✅ | ✅ |
| TransformerPolicy | 2025 | CALVIN | PyBullet | Transformer | Sentence-BERT | ✅ | ❌ | ❌ | ✅ | ✅ |
| HierarchicalLCL | 2025 | CALVIN | PyBullet | OpenFlamingoM-3B | ViT | ❌ | ✅ | ❌ | ✅ | ✅ |
| BLADE | 2025 | CALVIN | PyBullet | GPT-4 | PCT | ✅ | ✅ | ❌ | ✅ | ✅ |
| LES6DPose | 2025 | # | Isaac Gym | GPT-4 | PointNet++ | ✅ | ✅ | ❌ | ❌ | |
| SafetyFilter | 2025 | # | - | GPT-4o | CLIP | ✅ | ✅ | ❌ | ❌ | ✅ |
| TARAD | 2025 | RLBench | CoppeliaSim | GPT-4o | CLIP | ✅ | ✅ | ❌ | ✅ | ✅ |
| DISCO | 2025 | CALVIN | PyBullet | GPT-4o | * | ✅ | ✅ | ❌ | ❌ | ✅ |
| TinyVLA | 2025 | Meta-World | MuJoCo | Pythia | MLP | ✅ | ✅ | ❌ | ❌ | ✅ |
| ASD-QR | 2025 | ScalingUp | MuJoCo | GPT3 | CLIP | ❌ | ✅ | ✅ | ❌ | ✅ |
| RDT-1B | 2025 | # | - | GPT-4-Turbo | T5-XXL | ✅ | ✅ | ❌ | ✅ | ❌ |
| GRAVMAD | 2025 | RLBench | CoppeliaSim | GPT-4o | CLIP | ✅ | ✅ | ❌ | ✅ | ✅ |
| GR-MG | 2025 | CALVIN | PyBullet | Transformer | T5-Base | ✅ | ❌ | ❌ | ❌ | ✅ |
| LEMMo-Plan | 2025 | # | - | GPT-4o | * | ✅ | ✅ | ❌ | ❌ | ✅ |
If you find this survey or repository useful, please consider citing:
@article{zhou2023language,
author = {Hongkuan Zhou and
Xiangtong Yao and
Oier Mees and
Yuan Meng and
Ted Xiao and
Yonatan Bisk and
Jean Oh and
Edward Johns and
Mohit Shridhar and
Dhruv Shah and
Jesse Thomason and
Kai Huang and
Joyce Chai and
Zhenshan Bing and
Alois Knoll},
title = {Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation},
journal = {CoRR},
volume = {abs/2312.10807},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2312.10807}
}