This repository accompanies our survey paper:
The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning [Sheila Schoepp, Masoud Jafaripour*, Yingyue Cao*, Tianpei Yang, Fatemeh Abdollahi, Shadan Golestan, Zahin Sufiyan, Osmar R. Zaiane, Matthew E. Taylor] (*equal contribution) University of Alberta, Nanjing University, Alberta Machine Intelligence Institute (Amii) 📄 [arXiv Paper]
📌 This work provides a systematic taxonomy and analysis of how large language models (LLMs) and vision-language models (VLMs) enhance reinforcement learning (RL) tasks.
We categorize the integration of LLMs/VLMs into RL into three core roles:
- LLM/VLM as Agent: the model acts as a parametric or non-parametric decision-maker.
- LLM/VLM as Planner: the model generates comprehensive or incremental plans.
- LLM/VLM as Reward: the model defines or generates a reward function or model.
We also discuss:
- Challenges: sample inefficiency, reward engineering, poor generalization, etc.
- Future Directions: grounding, bias mitigation, action advice, multimodal representation.
- Unified taxonomy: Three key roles in FM-enhanced RL: Agent, Planner, Reward
- Comprehensive benchmarks: Over 40+ recent methods categorized and compared
- Multimodal perspective: Includes both LLM and VLM applications across domains
- Open challenges: In-depth discussion of limitations and paths forward
| Dataset | Year | Domain | Modality | Link |
|---|---|---|---|---|
| MineDojo | 2022 | Minecraft | Vision+Text | GitHub |
| SayCan | 2022 | Robotics | Language+Action | GitHub |
| RL4VLM | 2024 | Multi-task RL | Vision+Language | GitHub |
(See paper for full list.)
Parametric (fine-tuned): AGILE, TWOSOME, POAD, Retroformer, Zhai et al.
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| AGILE [Feng et al., 2024] | Meerkat, Vicuna-1.5 | ✓* | acc, rew | Link | |
| Retroformer [Yao et al., 2024] | GPT-3, GPT-4, LongChat | ✓ | sr, se | - | |
| TWOSOME [Tan et al., 2024] | Llama | ✓* | sr, rew, gen, se | Link | |
| POAD [Wen et al., 2024] | CodeLlama, Llama 2 | ✓* | rew, gen, se | Link | |
| GLAM [Carta et al., 2023] | FLAN-T5 | ✓ | se, gen | Link | |
| Zhai et al. [Zhai et al., 2024] | LLaVA-v1.6-Mistral | ✓* | sr | Link |
Non-parametric: Reflexion, ExpeL, ICPI, RLingua, REMEMBERER
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| ICPI [Brooks et al., 2023] | Codex | × | rew, gen | Link | |
| Reflexion [Shinn et al., 2023] | GPT-3, GPT-3.5-Turbo, GPT-4 | × | sr, acc | Link | |
| REMEMBERER [Zhang et al., 2023a] | GPT-3.5 | × | sr, rob | Link | |
| ExpeL [Zhao et al., 2024] | GPT-3.5-Turbo, GPT-4 | × | sr, gen | Link | |
| RLingua [Chen et al., 2024] | GPT-4 | × | sr, se | - | |
| Xu et al. [Xu et al., 2024] | GPT-3.5-Turbo | × | sr, rob | - | |
| LangGround [Li et al., 2024] | GPT-4 | × | sr, gen, se, int | Link |
Comprehensive planners: SayTap, PSL, LMA3, Inner Monologue
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| SayTap [Tang et al., 2023] | GPT-4 | × | sr, acc | - | |
| LgTS [Shukla et al., 2024] | Llama 2 | × | sr, se | - | |
| PSL [Dalal et al., 2024] | GPT-4 | × | sr, gen, se | Link | |
| LLaRP [Szot et al., 2024] | Llama | × | sr, gen, rob, se | Link | |
| LMA3 [Colas et al., 2023] | GPT-3.5-Turbo | × | gen, exp | - | |
| When2Ask [Hu et al., 2024] | Vicuna | × | sr | - | |
| Inner Monologue [Huang et al., 2022] | GPT-3, PaLM | × | sr, rob, al | - |
Incremental planners: SayCan, BOSS, AdaRefiner, LLM4Teach
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| SayCan [Ichter et al., 2022] | PaLM | × | sr, rob | Link | |
| LLM4Teach [Zhou et al., 2024] | ChatGLM-Turbo, Vicuna | × | sr, se | Link | |
| AdaRefiner [Zhang and Lu, 2024] | Llama 2, GPT-4 | ✓* | sr, rew, gen, exp | Link | |
| BOSS [Zhang et al., 2023b] | Llama | × | sr, gen, rob, se | - | |
| Text2Motion [Lin et al., 2023] | Codex, GPT-3.5 | × | sr, gen, int | - |
Reward Function: Text2Reward, Eureka, Zeng et al.
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| Text2Reward [Xie et al., 2024] | GPT-4 | × | sr, se, al | Link | |
| Zeng et al. [2024] | GPT-4 | × | sr, se | - | |
| Eureka [Ma et al., 2024] | GPT-4 | × | sr, gen, se, al | Link |
Reward Model: VLM-RM, MineCLIP, RL-VLM-F
| Method | Model(s) | FT | RL Role | Metrics | Code |
|---|---|---|---|---|---|
| Kwon et al. [2023] | GPT-3 | × | acc, se, al | - | |
| PREDILECT [Holk et al., 2024] | GPT-4 | × | rew, se, al | - | |
| ELLM [Du et al., 2023] | Codex, GPT-3 | × | sr, gen, se, exp | - | |
| RL-VLM-F [Wang et al., 2024] | Gemini-Pro, GPT-4V | × | sr, rew, se | Link | |
| VLM-RM [Rocamonde et al., 2024] | CLIP | × | sr, al | Link | |
| MineCLIP [Fan et al., 2022] | CLIP | ✓* | sr, gen, se, al | Link |
- Grounding: Bridging high-level plans with low-level controllers
- Bias Mitigation: Debiasing pretrained FMs for RL safety
- Multimodal Representation: Richer integration of language, vision, and control
- Action Advice: Using FMs as virtual oracles to guide agents
If you find this work helpful, please consider citing our survey:
@article{schoepp2024llmrlsurvey,
title={The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning},
author={Schoepp, Sheila and Jafaripour, Masoud and Cao, Yingyue and Yang, Tianpei and Abdollahi, Fatemeh and Golestan, Shadan and Sufiyan, Zahin and Zaiane, Osmar R. and Taylor, Matthew E.},
journal={arXiv preprint arXiv:2502.15214},
year={2024}
}We welcome pull requests to add missing papers, implementations, or benchmarks!
How to contribute:
- Fork the repository
- Add your paper or code to the relevant section in
README.md - Use the format:
| [Title](Paper Link) | Category | [Code](Code Link) | - Open a pull request
2025/06/07