This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey
[Paper]
Feel free to contact us or pull requests if you find any related papers that are not included here.
Last update on 2023/7/23
- [CVPR 2023] RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training [Paper]
- [CVPR 2023] DeAR: Debiasing Vision-Language Models with Additive Residuals [Paper]
- [CVPR 2023] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training [Paper][Code]
- [arXiv 2023] Improving CLIP Training with Language Rewrites [Paper][Code]
- [arXiv 2023] Too Large; Data Reduction for Vision-Language Pre-Training [Paper][Code]
- [arXiv 2023] Segment Anything [Paper][Code]
- [arXiv 2023] Semantic-SAM: Segment and Recognize Anything at Any Granularity [Paper][Code]
- [arXiv 2023] Segment Everything Everywhere All at Once [Paper][Code]
- [CVPR 2023] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [Paper][Code]
- [CVPR 2023] Learning to Name Classes for Vision and Language Models [Paper]
- [CVPR 2023] Semantic Prompt for Few-Shot Image Recognition [Paper]
- [CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners [Paper][Code]
- [CVPR 2023] Task Residual for Tuning Vision-Language Models [Paper][Code]
- [arXiv 2023] ProTeCt: Prompt Tuning for Hierarchical Consistency [Paper]
- [arXiv 2023] Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification [Paper]
- [arXiv 2023] Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning [Paper][Code]
- [arXiv 2023] Fine-Grained Visual Prompting [Paper]
- [ACL 2023] Deeply Coupled Cross-Modal Prompt Learning [Paper][Code]
- [arXiv 2023] SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More [Paper][Code]
- [arXiv 2023] Segment Anything in High Quality [Paper][Code]
- [arXiv 2023] Personalize Segment Anything Model with One Shot [Paper][Code]
- [arXiv 2023] Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation [Paper]
- [CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]
- [CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper]
- [CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]
- [CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]
- [CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
- [CVPR 2023] Detecting Everything in the Open World: Towards Universal Object Detection [Paper][Code]
- [CVPR 2023] CapDet: Unifying Dense Captioning and Open-World Detection Pretraining [Paper]
- [arXiv 2023] Contextual Object Detection with Multimodal Large Language Models [Paper][Code]
- [arXiv 2023] Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models [Paper][Code]
- [CVPR 2023] FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation [Paper][Code]
- [CVPR 2023] Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations [Paper][Code]
- [arXiv 2023] Exploring Open-Vocabulary Semantic Segmentation without Human Labels [Paper]
- [arXiv 2023] OpenVIS: Open-vocabulary Video Instance Segmentation [Paper]
- [arXiv 2023] Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation [Paper]
- [arXiv 2023] Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation [Paper][Code]
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
If you find our work useful in your research, please consider citing:
@article{zhang2023vision,
title={Vision-Language Models for Vision Tasks: A Survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={arXiv preprint arXiv:2304.00685},
year={2023}
}
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
Dataset | Year | Num of Image-Text Paris | Language | Project |
---|---|---|---|---|
SBU Caption | 2011 | 1M | English | Project |
COCO Caption | 2016 | 1.5M | English | Project |
Yahoo Flickr Creative Commons 100 Million | 2016 | 100M | English | Project |
Visual Genome | 2017 | 5.4M | English | Project |
Conceptual Captions 3M | 2018 | 3.3M | English | Project |
Localized Narratives | 2020 | 0.87M | English | Project |
Conceptual 12M | 2021 | 12M | English | Project |
Wikipedia-based Image Text | 2021 | 37.6M | 108 Languages | Project |
Red Caps | 2021 | 12M | English | Project |
LAION400M | 2021 | 400M | English | Project |
LAION5B | 2022 | 5B | Over 100 Languages | Project |
WuKong | 2022 | 100M | Chinese | Project |
CLIP | 2021 | 400M | English | - |
ALIGN | 2021 | 1.8B | English | - |
FILIP | 2021 | 300M | English | - |
WebLI | 2022 | 12B | English | - |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
MNIST | 1998 | 10 | 60,000 | 10,000 | Accuracy | Project |
Caltech-101 | 2004 | 102 | 3,060 | 6,085 | Mean Per Class | Project |
PASCAL VOC 2007 | 2007 | 20 | 5,011 | 4,952 | 11-point mAP | Project |
Oxford 102 Flowers | 2008 | 102 | 2,040 | 6,149 | Mean Per Class | Project |
CIFAR-10 | 2009 | 10 | 50,000 | 10,000 | Accuracy | Project |
CIFAR-100 | 2009 | 100 | 50,000 | 10,000 | Accuracy | Project |
ImageNet-1k | 2009 | 1000 | 1,281,167 | 50,000 | Accuracy | Project |
SUN397 | 2010 | 397 | 19,850 | 19,850 | Accuracy | Project |
SVHN | 2011 | 10 | 73,257 | 26,032 | Accuracy | Project |
STL-10 | 2011 | 10 | 1,000 | 8,000 | Accuracy | Project |
GTSRB | 2011 | 43 | 26,640 | 12,630 | Accuracy | Project |
KITTI Distance | 2012 | 4 | 6,770 | 711 | Accuracy | Project |
IIIT5k | 2012 | 36 | 2,000 | 3,000 | Accuracy | Project |
Oxford-IIIT PETS | 2012 | 37 | 3,680 | 3,669 | Mean Per Class | Project |
Stanford Cars | 2013 | 196 | 8,144 | 8,041 | Accuracy | Project |
FGVC Aircraft | 2013 | 100 | 6,667 | 3,333 | Mean Per Class | Project |
Facial Emotion | 2013 | 8 | 32,140 | 3,574 | Accuracy | Project |
Rendered SST2 | 2013 | 2 | 7,792 | 1,821 | Accuracy | Project |
Describable Textures | 2014 | 47 | 3,760 | 1,880 | Accuracy | Project |
Food-101 | 2014 | 101 | 75,750 | 25,250 | Accuracy | Project |
Birdsnap | 2014 | 500 | 42,283 | 2,149 | Accuracy | Project |
RESISC45 | 2017 | 45 | 3,150 | 25,200 | Accuracy | Project |
CLEVR Counts | 2017 | 8 | 2,000 | 500 | Accuracy | Project |
PatchCamelyon | 2018 | 2 | 294,912 | 32,768 | Accuracy | Project |
EuroSAT | 2019 | 10 | 10,000 | 5,000 | Accuracy | Project |
Hateful Memes | 2020 | 2 | 8,500 | 500 | ROC AUC | Project |
Country211 | 2021 | 211 | 43,200 | 21,100 | Accuracy | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
Flickr30k | 2014 | - | 31,783 | - | Recall | Project |
COCO Caption | 2015 | - | 82,783 | 5,000 | Recall | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
UCF101 | 2012 | 101 | 9,537 | 1,794 | Accuracy | Project |
Kinetics700 | 2019 | 700 | 494,801 | 31,669 | Mean (top1, top5) | Project |
RareAct | 2020 | 122 | 7,607 | - | mWAP, mSAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,000 | 5,000 | Box mAP | Project |
LVIS | 2019 | 1203 | 118,000 | 5,000 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |
Paper | Published in | Code/Project |
---|---|---|
FLAVA: A Foundational Language And Vision Alignment Model | CVPR 2022 | Code |
CoCa: Contrastive Captioners are Image-Text Foundation Models | arXiv 2022 | Code |
Too Large; Data Reduction for Vision-Language Pre-Training | arXiv 2023 | Code |
SAM: Segment Anything | arXiv 2023 | Code |
SEEM: Segment Everything Everywhere All at Once | arXiv 2023 | Code |
Semantic-SAM: Segment and Recognize Anything at Any Granularity | arXiv 2023 | Code |
Paper | Published in | Code/Project |
---|---|---|
GLIP: Grounded Language-Image Pre-training | CVPR 2022 | Code |
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection | NeurIPS 2022 | - |
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training | CVPR 2023 | Code |
Paper | Published in | Code/Project |
---|---|---|
Exploring Visual Prompts for Adapting Large-Scale Models | arXiv 2022 | Code |
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification | arXiv 2023 | - |
Fine-Grained Visual Prompting | arXiv 2023 | - |
Paper | Published in | Code/Project |
---|---|---|
UPT: Unified Vision and Language Prompt Learning | arXiv 2022 | Code |
MVLPT: Multitask Vision-Language Prompt Tuning | arXiv 2022 | Code |
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model | arXiv 2022 | Code |
MaPLe: Multi-modal Prompt Learning | CVPR 2023 | Code |