Chenyang Liu
·
Jiafan Zhang
·
Keyan Chen
·
Man Wang
·
Zhengxia Zou
·
Zhenwei Shi*✉
This repo is used for recording, and tracking recent Remote Sensing Spatio-Temporal Vision-Language Models (RS-STVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.
Share us a ⭐ if you're interested in this repo. We will continue to track relevant progress and update this repository.
- You are welcome to give us an issue or PR for your RS-STVLM work !!!!! We will record it for next version update of our survey
🔥🔥🔥 The rep is updating 🔥🔥🔥
✅ The first survey for Remote Sensing Spatio-Temporal Vision-Language Models.
✅ Some public datasets and code links are provided.
✅ We will continue to track related work in this repository.
Timeline of RS-STVLMs:
- 📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods
- 👨🏫 Large Language Models Meets Temporal Images
- 🛰️ Dataset
- 💻 Others
- 🖊️ Citation
- 🐲 Contact
| ........
| Time | Model Name | Paper Title | Code/Project |
|---|---|---|---|
| 2024.06 | ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | |
| 2025.01 | text-ITSR | Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing | N/A |
| ........ |
| Time | Model Name | Grounding Output | Paper Title | Code/Project |
|---|---|---|---|---|
| 2024.09 | ChangeChat | mask | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | |
| 2024.10 | TEOChat | bbox | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | |
| 2024.10 | VisTA | mask | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | |
| 2024.12 | RSUniVLM | mask | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | |
| 2024.12 | EarthDial | bbox | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | |
| 2025.03 | Falcon | mask | Falcon: A Remote Sensing Vision-Language Foundation Model | |
| 2025.03 | GeoRSMLLM | mask | GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | N/A |
| ........ |
| Time | Model Name | Paper Title | Code/Project |
|---|---|---|---|
| 2025.02 | TGIPG | Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning | N/A |
| 2025.04 | ChangeDiff | ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model | |
| 2025.07 | -- | Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset | Link |
| 2025.07 | ChangeBridge | ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing | N/A |
| ........ |
| Time | Method | Paper Title | Function | Code |
|---|---|---|---|---|
| 2024.01 | RSChatgpt | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | Single-image analysis | |
| 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | Spatio-Temporal Change Interpretation | |
| 2024.06 | RS-Agent | RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Tool selection and knowledge search | Link |
| 2024.07 | RS-AGENT | RS-AGENT: Large Language Models Guided Agent System for Remote Sensing Image Generation | Image Generation | N/A |
| 2024.12 | GeoTool-GPT | GeoTool-GPT: a trainable method for facilitating Large Language Models to master GIS tools | Master GIS tools | N/A |
| 2025.01 | RescueADI | RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images With Autonomous Agents | Disaster Interpretation | N/A |
| ........ |
| Dataset | Time | Image Size | Image Resolution | Image Pairs | Captions* | Masks | Temporal Image Data Source | Anno. | Link |
|---|---|---|---|---|---|---|---|---|---|
| DUBAI CCD | 2022.08 | 50×50 | 30m | 500 | 2,500 | - | Landsat-7 imagery | Manual | Link |
| LEVIR CCD | 2022.08 | 256×256 | 0.5m | 500 | 2,500 | - | LEVIR-CD | Manual | Link |
| LEVIR-CC | 2022.11 | 256×256 | 0.5m | 10,077 | 50,385 | - | LEVIR-CD | Manual | Link |
| CCExpert | 2024.11 | - | - | 200K | 1.2M | - | LEVIR-CC, CLVER-Change, ImageEdit, Spot-the-dif, STVchrono, Vismin, ChangeSim, SYSU-CD, SECOND | Auto. | Link |
| SECTION | 2025.07 | 256×256 | 0.3-3m | 4,059 | 12,200 | - | SECOND | Manual | Link |
| LEVIR-MCI | 2024.03 | 256×256 | 0.5m | 10,077 | 50,385 | building, road | LEVIR-CC | Manual | Link |
| LEVIR-CDC | 2024.11 | 256×256 | 0.5m | 10,077 | 50,385 | building | LEVIR-CC | Manual | Link |
| WHU-CDC | 2024.11 | 256×256 | 0.075m | 7,434 | 37,170 | building | WHU-CD | Manual | Link |
| SECOND-CC | 2025.01 | 256×256 | 0.3∼3m | 6,041 | 30,205 | 6 classes | SECOND | Manual | Link |
| Dataset | Time | Instruction Samples | Number of Images | Temporal Length | Temporal Image Data Source | Anno. | Link |
|---|---|---|---|---|---|---|---|
| CDVQA | 2022.09 | 122,000 | 2,968 | 2 | SECOND | Manual | Link |
| ChangeChat-87k | 2024.09 | 87,195 | 10,077 | 2 | LEVIR-CC, LEVIR-MCI | Auto. | Link |
| QAG-360K | 2024.10 | 360,000 | 6,810 | 2 | Hi-UCD, SECOND, LEVIR-CD | Auto. | Link |
| GeoLLaVA | 2024.10 | 100,000 | 100,000 | 2 | fMoW | Auto. | Link |
| TEOChatlas | 2024.10 | 554,071 | - | 1~8 | xBD, S2Looking, QFabric, fMoW | Auto. | Link |
| EarthDial | 2024.12 | 11.11 Million | - | 1~4 | fMoW, TreeSatAI-Time-Series, MUDS, xBD, QuakeSet | Manual & Auto. | Link |
| UniRS | 2024.12 | 318.8 K | - | 1~T (T>2) | LEVIR-CC, ERA-Video | Auto. | Link |
| Falcon_SFT | 2025.03 | 78 Million | 5.6 Million | 1~2 | CDD, EGY-BCD, HRSCD, LEVIR-CD, MSBC, MSOSCD, NJDS, S2Looking, SYSU-CD, WHU-CD | Auto. | Link |
| DVL-Suite | 2025.05 | 69,926 | 15,063 | 6.9 (Average) | U.S. National Agriculture Imagery Program (NAIP) | Manual & Auto. | N/A |
| .... |
| Time | Model Name | Paper Title | Code/Project |
|---|---|---|---|
| 2023.06 | RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Link |
| 2023.06 | GeoRSCLIP | RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing | Link |
| 2023.12 | SkyCLIP | SkyScript: a large and semantically diverse vision-language dataset for remote sensing | Link |
| 2025.01 | Git-RSCLIP | Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model | Link |
If you find our survey and repository useful for your research, please consider citing our paper:
@ARTICLE{liu2024RSSTVLMsurvey,
author={Liu, Chenyang and Zhang, Jiafan and Chen, Keyan and Wang, Man and Zou, Zhengxia and Shi, Zhenwei},
journal={IEEE Geoscience and Remote Sensing Magazine},
title={Remote Sensing Spatiotemporal Vision–Language Models: A comprehensive survey},
year={2025},
volume={},
number={},
pages={2-42},
doi={10.1109/MGRS.2025.3598283}}liuchenyang@buaa.edu.cn
