🎮 Awesome Remote Sensing Image-Text Retrieval | Remote Sensing Cross-model Retrieval | Remote Sensing Vision-Lanuage Models

🧭 Guideline

A benchmark and awesome collection of papers on Remote Sensing Image-Text Retrieval (RSITR) ｜ Remote Sensing Cross-model Retrieval (RSCMR) from the Internet, if there are any omissions, please contact me jiancheng.pan.plus@gmail.com. 🤝 If you want to join Remote Sensing Vision-Language Models (RSVLMs), you can click Slack Group.

🎮 Awesome Remote Sensing Image-Text Retrieval | Remote Sensing Cross-model Retrieval | Remote Sensing Vision-Lanuage Models

💻 News

Record the major news of RSVLMs community.

2023/12/20:: SkyScript-a comprehensive vision-language dataset for remote sensing images covering 29K distinct semantic tags (AAAI 2024) [link].
2023/11/24: GeoChat: Grounded Large Vision-Language Model for Remote Sensing [link].
2023/06/20: 5M+ image-text pairs datasets RS5M for remote sensing released [link].
2023/06/19: The first vision-language foundation model for remote sensing RemoteCLIP proposed [link].

📊 Remote Sensing Captions Dataset

Collect the more popular image-text pairs datasets on remote sensing, and welcome contact for additions if there are more.

Dataset Name	Image size	Image Resolution	VLMs
UCM-Captions	613	256 × 256	-
Sydney-Captions	2,100	500 × 500	-
RSICD	10,921	224 × 224	-
RSITMD	4,743	256 × 256	-
NWPU-Captions	31,500	256 × 256	-
RS5M	5 million+	All Resolutions	GeoRSCLIP
SkyScript	5.2 million+	All Resolutions	SkyCLIP

🆚 RSITR | RSCMR Benchmark

Welcome to add more RSITR | RSCMR methods.

📌 Cross-Modal Retrieval on RSICD:

https://paperswithcode.com/sota/cross-modal-retrieval-on-rsicd

📌 Cross-Modal Retrieval on RSITMD:

https://paperswithcode.com/sota/cross-modal-retrieval-on-rsitmd

📖 RSITR | RSCMR Method

Closed-Domain Method: Training and testing on a single dataset.

Open-Domain Method: Using extra datasets for pre-training to gain more inter-domain knowledge.

Hashing Method: Efficient retrieval on large-scale datasets becomes feasible.

Open-Domain Method

[AAAI 2024] | SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | [paper] [github]
- Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, Ram Rajagopal
[ArXiv 2023] | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | [paper] [github]
- Fan Liu, Delong Chen, Zhan-Rong Guan, Xiaocong Zhou, Jiale Zhu, Jun Zhou
[ArXiv 2023] | RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model | [paper] [github]
- Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin.
[ArXiv 2023] | RSGPT: A Remote Sensing Vision Language Model and Benchmark | [paper]
- Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li.
[TGRS 2023] | Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval | [paper]
- Yuan Yuan, Yangfan Zhan, Zhitong Xiong.

Closed-Domain Method

[ACMMM 2023] | A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | [paper] [github]
- Jiancheng Pan, Qing Ma, Cong Bai.
[ArXiv 2023] | Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval | [paper]
- Jiancheng Pan, Qing Ma, Cong Bai.
[Sensors 2023] | A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval | [paper]
- Fuzhong Zheng, Xu Wang, Luyao Wang, Xiong Zhang, Hongze Zhu, Long Wang, Haisu Zhang.
[Remote Sensing 2023] | A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing | [paper]
- Xiong Zhang, Weipeng Li, Xu Wang, Luyao Wang, Fuzhong Zheng, Long Wang, Haisu Zhang.
[IGARSS 2023] | A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image-Text Retrieval | [paper]
- Rui Yang, Di Zhang, Yanhe Guo, Shuang Wang.
[IGARSS 2023] | A Fast and Accurate Method for Remote Sensing Image-Text Retrieval Based On Large Model Knowledge Distillation | [paper]
- Yu Liao, Rui Yang, Tao Xie, Hantong Xing, Dou Quan, Shuang Wang, B. Hou.
[TGRS 2023] | Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval | [paper]
- Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, Xuelong Li.
[Mathematics 2023] | An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval | [paper]
- Liu He, Shuyan Liu, Ran An, Yudong Zhuo, Jian Tao.
[TGRS 2023] Hypersphere-based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning | [paper]
- Weihang Zhang, Jihao Li, Shuoke Li, Jialiang Chen, Wenkai Zhang, Xin Gao, Xian Sun.
[TGRS 2023] | Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval | [paper]
- Xu Tang, Yijing Wang, Jingjing Ma, Xiangrong Zhang, F. Liu, Licheng Jiao.
[ICMR 2023] | Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval | [paper] [github]
- Jiancheng Pan, Qing Ma, Cong Bai.
[CDCEO 2022] | Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images | [paper]
- Li Mi, Siran Li, Christel Chappuis, D. Tuia.
[IGARSS 2022] | A transformer-based cross-modal image-text retrieval method using feature decoupling and reconstruction | [paper]
- Huan Zhang, Yingzhi Sun, Yu Liao, Siyuan Xu, R. Yang, Shuang Wang, B. Hou, Licheng Jiao.
[INT J APPL EARTH OBS 2022] | MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing | [paper]
- Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, K. Fu, Xian Sun.
[JSTARS 2022] | Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval | [paper]
- Mohamad Mahmoud Al Rahhal, Y. Bazi, Norah A. Alsharif, Laila Bashmal, N. Alajlan, F. Melgani.
[Applied Sciences 2022] | Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval | [paper]
- Mohamad Mahmoud Al Rahhal, M. Bencherif, Y. Bazi, Abdullah Alharbi, M. L. Mekhalfi.
[TGRS 2022] | Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information | [paper] [github]
- Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, K. Fu, Xian Sun.
[TGRS 2021] | A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing | [paper]
- Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, K. Fu, Xian Sun.
[TGRS 2021] | Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval | [paper] [github]
- Zhiqiang Yuan, Wenkai Zhang, K. Fu, Xuan Li, Chubo Deng, Hongqi Wang, Xian Sun.
[JSTARS 2021] | A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing | [paper]
- Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, Liang Zhang.
[LGRS 2021] | Fusion-Based Correlation Learning Model for Cross-Modal Remote Sensing Image Retrieval | [paper]
- Yafei Lv, Wei Xiong, Xiaohan Zhang, Yaqi Cui.
[Remote Sensing 2020] | TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images | [paper]
- T. M. Ali, Y. Bazi, Mohamad Mahmoud Al Rahhal, M. L. Mekhalfi, Lalitha Rangarajan, M. Zuair.

Hashing Method

[JSTARS 2022] | Remote Sensing Cross-Modal Retrieval by Deep Image-Voice Hashing | [paper]
- Yichao Zhang, Xiangtao Zheng, Xiaoqiang Lu.
[ArXiv 2022] | Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing | [paper]
- Georgii Mikriukov, Mahdyar Ravanbakhsh, Begüm Demir.
[ICIP 2022] | An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing | [paper]
- Georgii Mikriukov, Mahdyar Ravanbakhsh, Begüm Demir.

jaychempan/Awesome-RSITR