A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO) Awesome

This list is created and maintained by Ali Koteich and Hasan Moughnieh from the GEOspatial Artificial Intelligence (GEOAI) research group at the National Center for Remote Sensing - CNRS, Lebanon.

We encourage you to contribute to this project according to the following guidelines.

---If you find this repository useful, please consider giving it a ⭐

Table Of Contents

Foundation Models

Year Title Paper Code Venue
2024 EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain paper
2024 RemoteCLIP: A Vision Language Foundation Model for Remote Sensing paper code
2024 Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models paper code
2024 SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model paper code
2023 GeoChat: Grounded Large Vision-Language Model for Remote Sensing paper code
2023 Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment paper

Image Captioning

Year Title Paper Code Venue
2024 A Lightweight Transformer for Remote Sensing Image Change Captioning paper code
2024 RSCaMa: Remote Sensing Image Change Captioning with State Space Model paper code
2023 Captioning Remote Sensing Images Using Transformer Architecture paper International Conference on Artificial Intelligence in Information and Communication
2023 Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning paper MDPI Remote Sensing
2023 Progressive Scale-aware Network for Remote sensing Image Change Captioning paper
2023 Towards Unsupervised Remote Sensing Image Captioning and Retrieval with Pre-Trained Language Models paper Proceedings of the Japanese Association for Natural Language Processing
2022 A Joint-Training Two-Stage Method for Remote Sensing Image Captioning paper IEEE TGRS
2022 A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning paper MDPI Remote Sensing
2022 Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis paper IEEE TGRS
2022 Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning paper code IEEE GRSL
2022 Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach paper code Engineering Applications of Artificial Intelligence
2022 Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image paper IEEE TGRS
2022 High-Resolution Remote Sensing Image Captioning Based on Structured Attention paper IEEE TGRS
2022 Meta captioning: A meta learning based remote sensing image captioning framework paper code Elsevier PHOTO
2022 Multiscale Multiinteraction Network for Remote Sensing Image Captioning paper IEEE JSTARS
2022 NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning paper code IEEE TGRS
2022 Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning paper IEEE TGRS
2022 Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset paper IEEE TGRS
2022 Transforming remote sensing images to textual descriptions paper Int J Appl Earth Obs Geoinf
2022 Using Neural Encoder-Decoder Models with Continuous Outputs for Remote Sensing Image Captioning paper IEEE Access
2021 A Novel SVM-Based Decoder for Remote Sensing Image Captioning paper IEEE TGRS
2021 SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning paper code IEEE TGRS
2021 Truncation Cross Entropy Loss for Remote Sensing Image Captioning paper IEEE TGRS
2021 Word-Sentence Framework for Remote Sensing Image Captioning paper IEEE TGRS
2020 A multi-level attention model for remote sensing image captions paper MDPI Remote Sensing
2020 Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning paper Elservier Knowledge-Based Systems
2020 Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective paper IEEE JSTARS
2019 LAM: Remote sensing image captioning with attention-based language model paper IEEE TGRS
2019 Learning to Caption Remote Sensing Images by Geospatial Feature Driven Attention Mechanism paper IEEE JSTARS
2019 Remote Sensing Image Captioning by Deep Reinforcement Learning with Geospatial Features paper IEEE TGRS

Text-Image Retrieval

Year Title Paper Code Venue
2024 Composed Image Retrieval for Remote Sensing paper code
2024 Multi-Spectral Remote Sensing Image Retrieval using Geospatial Foundation Models paper code
2024 Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval paper code
2023 A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval paper code ACM MM 2023 (Oral)
2023 A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing paper MDPI Remote Sensing
2023 An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval paper MDPI Mathematics
2023 Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval paper MDPI Applied Sciences
2023 Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning paper code IEEE TGRS
2023 Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval paper IEEE TGRS
2023 Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval paper code ICMR'23
2022 A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing paper code IEEE TGRS
2022 An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing paper code IEEE ICIP
2022 CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study paper Virginia Polytechnic Institute and State University
2022 Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images paper
2022 MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing paper code Int J Appl Earth Obs Geoinf
2022 Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval paper IEEE JSTARS
2022 Multisource Data Reconstruction-Based Deep Unsupervised Hashing for Unisource Remote Sensing Image Retrieval Paper code IEEE TGRS
2022 Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information paper code IEEE TGRS
2022 Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote Sensing paper code IEEE ICASSP
2021 Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval paper code IEEE TGRS
2020 Deep unsupervised embedding for remote sensing image retrieval using textual cues paper MDPI Applied Sciences
2020 TextRS: Deep bidirectional triplet network for matching text to remote sensing images paper MDPI Remote Sensing
2020 Toward Remote Sensing Image Retrieval under a Deep Image Captioning Perspective paper IEEE JSTARS

Visual Grounding

Year Title Paper Code Venue
2023 LaLGA: Multi-Scale Language-Aware Visual Grounding on Remote Sensing Data paper code
2023 Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models paper code
2022 RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data paper code IEEE TGRS
2022 Visual Grounding in Remote Sensing Images paper ACM MM

Visual Question Answering

Year Title Paper Code Venue
2023 A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering paper IEEE TGRS
2023 EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering paper code AAAI 2024
2023 LIT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing paper code IEEE IGARSS
2023 Multistep Question-Driven Visual Question Answering for Remote Sensing paper code IEEE TGRS
2023 RSGPT: A Remote Sensing Vision Language Model and Benchmark paper code
2023 RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering paper code
2022 Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery paper IEEE TGRS
2022 Change Detection Meets Visual Question Answering paper code IEEE TGRS
2022 From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data paper code IEEE TGRS
2022 Language Transformers for Remote Sensing Visual Question Answering paper IEEE IGARSS
2022 Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing paper code SPIE Image and Signal Processing for Remote Sensing
2022 Mutual Attention Inception Network for Remote Sensing Visual Question Answering paper code IEEE TGRS
2022 Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering paper CVPRW
2021 How to find a good image-text embedding for remote sensing visual question answering? paper CEUR Workshop Proceedings
2021 Mutual Attention Inception Network for Remote Sensing Visual Question Answering paper code IEEE TGRS
2021 RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensing paper code IEEE IGARSS
2020 RSVQA: Visual Question Answering for Remote Sensing Data paper code IEEE TGRS

Vision-Language Remote Sensing Datasets

Name Link Paper Link Description
RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model Link Paper Link Size: 5 million remote sensing images with English descriptions
Resolution : 256 x 256
Platforms: 11 publicly available image-text paired dataset
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing Link paper Link Size : 5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags
Remote Sensing Visual Question Answering Low Resolution Dataset(RSVQA LR) Link Paper Link Size: 772 images & 77,232 questions and answers
Resolution : 256 x 256
Platforms: Sentinel-2 and Open Street Map
Use: Remote Sensing Visual Question Answering
Remote Sensing Visual Question Answering High Resolution Dataset(RSVQA HR) Link Paper Link Size: 10,659 images & 955,664 questions and answers
Resolution : 512 x 512
Platforms: USGS and Open Street Map
Use: Remote Sensing Visual Question Answering
Remote Sensing Visual Question Answering BigEarthNet Dataset (RSVQA x BEN) Link Paper Link Size: 140,758,150 image/question/answer triplets
Resolution : High-resolution (15cm)
Platforms: Sentinel-2, BigEarthNet and Open Street Map
Use: Remote Sensing Visual Question Answering
Remote Sensing Image Visual Question Answering (RSIVQA) Link Paper Link Size: 37,264 images and 111,134 image-question-answer triplets
A small part of RSIVQA is annotated by human. Others are automatically generated using existing scene classification datasets and object detection datasets
Use: Remote Sensing Visual Question Answering
FloodNet Visual Question Answering Dataset Link Paper Link Size: 11,000 question-image pairs
Resolution : 224 x 224
Platforms: UAV-DJI Mavic Pro quadcopters, after Hurricane Harvey
Use: Remote Sensing Visual Question Answering
Change Detection-Based Visual Question Answering Dataset Link Paper Link Size: 2,968 pairs of multitemporal images and more than 122,000 question–answer pairs
Classes: 6
Resolution : 512×512 pixels
Platforms: It is based on semantic change detection dataset (SECOND)
Use: Remote Sensing Visual Question Answering
LAION-EO link Paper Link Size : 24,933 samples with 40.1% english captions as well as other common languages from LAION-5B
mean height of 633.0 pixels (up to 9,999) and mean width of 843.7 pixels (up to 19,687)
Platforms : Based on LAION-5B
CapERA: Captioning Events in Aerial Videos Link Paper Link Size : 2864 videos and 14,320 captions, where each video is paired with five unique captions
Remote Sensing Image Captioning Dataset (RSICap) link Paper Link RSICap comprises 2,585 human-annotated captions with rich and high-quality information
This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc)
Remote Sensing Image Captioning Evaluation Dataset (RSIEval) link Paper Link 100 human-annotated captions and 936 visual question-answer pairs with rich information and open-ended questions and answers.
Can be used for Image Captioning and Visual Question-Answering tasks
Revised Remote Sensing Image Captioning Dataset (RSCID) Link Paper Link Size: 10,921 images with five captions per image
Number of Classes: 30
Resolution : 224 x 224
Platforms: Google Earth, Baidu Map, MapABC and Tianditu
Use: Remote Sensing Image Captioning
Revised University of California Merced dataset (UCM-Captions) Link Paper Link Size: 2,100 images with five captions per image
Number of Classes: 21
Resolution : 256 x 256
Platforms: USGS National Map Urban Area Imagery collection
Use: Remote Sensing Image Captioning
Revised Sydney-Captions Dataset Link Paper Link Size: 613 images with five captions per image
Number of Classes: 7
Resolution : 500 x 500
Platforms: GoogleEarth
Use: Remote Sensing Image Captioning
LEVIR-CC dataset Link Paper Link Size: 10,077 pairs of RS images and 50,385 corresponding sentences
Number of Classes: 10
Resolution : 1024 × 1024 pixels
Platforms: Beihang University
Use: Remote Sensing Image Captioning
NWPU-Captions dataset images_Link, info_Link Paper Link Size: 31,500 images with 157,500 sentences
Number of Classes: 45
Resolution : 256 x 256 pixels
Platforms: based on NWPU-RESISC45 dataset
Use: Remote Sensing Image Captioning
Remote sensing Image-Text Match dataset (RSITMD) Link Paper Link Size: 23,715 captions for 4,743 images
Number of Classes: 32
Resolution : 500 x 500
Platforms: RSCID and GoogleEarth
Use: Remote Sensing Image-Text Retrieval
PatterNet Link Paper Link Size: 30,400 images
Number of Classes: 38
Resolution : 256 x 256
Platforms: Google Earth imagery and via the Google Map AP
Use: Remote Sensing Image Retrieval
Dense Labeling Remote Sensing Dataset (DLRSD) Link Paper Link Size: 2,100 images
Number of Classes: 21
Resolution : 256 x 256
Platforms: Extension of the UC Merced
Use: Remote Sensing Image Retrieval (RSIR), Classification and Semantic Segmentation
Dior-Remote Sensing Visual Grounding Dataset (RSVGD) Link Paper Link Size: 38,320 RS image-query pairs and 17,402 RS images
Number of Classes: 20
Resolution : 800 x 800
Platforms: DIOR dataset
Use: Remote Sensing Visual Grounding
OPT-RSVG Dataset link Paper Link Size : 25,452 Images and 48,952 expression in English and Chinese
Number of Classes : 14
Resolution : 800 x 800
Visual Grounding in Remote Sensing Images link Paper Link Size : 4,239 images including 5,994 object instances and 7,933 referring expressions
Images are 1024×1024 pixels
Platforms: multiple sensors and platforms (e.g. Google Earth)
Remote Sensing Image Scene Classification (NWPU-RESISC45) Link Paper Link Size: 31,500 images
Number of Classes: 45
Resolution : 256 x 256 pixels
Platforms: Google Earth
Use: Remote Sensing Image Scene Classification

Related Repositories & Libraries

---Stay tuned for continuous updates and improvements! 🚀