Vision Language Models for Vision Tasks: A Survey
This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey
[Paper]
Feel free to contact us or pull requests if you find any related papers that are not included here.
News
Last update on 2023/8/21
VLM Pre-training Methods
VLM Transfer Learning Methods
- [ICCV 2023] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models [Paper][Code]
- [ICCV 2023] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? [Paper][Code]
- [ICCV 2023] PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization [Paper][Code]
- [ICCV 2023] Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models [Paper]
- [ICCV 2023] PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation [Paper]
- [ICCVW 2023] AD-CLIP: Adapting Domains in Prompt Space Using CLIP [Paper]
VLM Knowledge Distillation for Detection
- [arXiv 2023] Improving Pseudo Labels for Open-Vocabulary Object Detection [Paper]
VLM Knowledge Distillation for Segmentation
- [ICCV 2023] SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning [Paper][Code]
- [arXiv 2023] ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation [Paper]
- [arXiv 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP [Paper][Code]
Abstract
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
Citation
If you find our work useful in your research, please consider citing:
@article{zhang2023vision,
title={Vision-Language Models for Vision Tasks: A Survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={arXiv preprint arXiv:2304.00685},
year={2023}
}
Menu
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
Datasets
Datasets for VLM Pre-training
Dataset | Year | Num of Image-Text Paris | Language | Project |
---|---|---|---|---|
SBU Caption | 2011 | 1M | English | Project |
COCO Caption | 2016 | 1.5M | English | Project |
Yahoo Flickr Creative Commons 100 Million | 2016 | 100M | English | Project |
Visual Genome | 2017 | 5.4M | English | Project |
Conceptual Captions 3M | 2018 | 3.3M | English | Project |
Localized Narratives | 2020 | 0.87M | English | Project |
Conceptual 12M | 2021 | 12M | English | Project |
Wikipedia-based Image Text | 2021 | 37.6M | 108 Languages | Project |
Red Caps | 2021 | 12M | English | Project |
LAION400M | 2021 | 400M | English | Project |
LAION5B | 2022 | 5B | Over 100 Languages | Project |
WuKong | 2022 | 100M | Chinese | Project |
CLIP | 2021 | 400M | English | - |
ALIGN | 2021 | 1.8B | English | - |
FILIP | 2021 | 300M | English | - |
WebLI | 2022 | 12B | English | - |
Datasets for VLM Evaluation
Image Classification
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
MNIST | 1998 | 10 | 60,000 | 10,000 | Accuracy | Project |
Caltech-101 | 2004 | 102 | 3,060 | 6,085 | Mean Per Class | Project |
PASCAL VOC 2007 | 2007 | 20 | 5,011 | 4,952 | 11-point mAP | Project |
Oxford 102 Flowers | 2008 | 102 | 2,040 | 6,149 | Mean Per Class | Project |
CIFAR-10 | 2009 | 10 | 50,000 | 10,000 | Accuracy | Project |
CIFAR-100 | 2009 | 100 | 50,000 | 10,000 | Accuracy | Project |
ImageNet-1k | 2009 | 1000 | 1,281,167 | 50,000 | Accuracy | Project |
SUN397 | 2010 | 397 | 19,850 | 19,850 | Accuracy | Project |
SVHN | 2011 | 10 | 73,257 | 26,032 | Accuracy | Project |
STL-10 | 2011 | 10 | 1,000 | 8,000 | Accuracy | Project |
GTSRB | 2011 | 43 | 26,640 | 12,630 | Accuracy | Project |
KITTI Distance | 2012 | 4 | 6,770 | 711 | Accuracy | Project |
IIIT5k | 2012 | 36 | 2,000 | 3,000 | Accuracy | Project |
Oxford-IIIT PETS | 2012 | 37 | 3,680 | 3,669 | Mean Per Class | Project |
Stanford Cars | 2013 | 196 | 8,144 | 8,041 | Accuracy | Project |
FGVC Aircraft | 2013 | 100 | 6,667 | 3,333 | Mean Per Class | Project |
Facial Emotion | 2013 | 8 | 32,140 | 3,574 | Accuracy | Project |
Rendered SST2 | 2013 | 2 | 7,792 | 1,821 | Accuracy | Project |
Describable Textures | 2014 | 47 | 3,760 | 1,880 | Accuracy | Project |
Food-101 | 2014 | 101 | 75,750 | 25,250 | Accuracy | Project |
Birdsnap | 2014 | 500 | 42,283 | 2,149 | Accuracy | Project |
RESISC45 | 2017 | 45 | 3,150 | 25,200 | Accuracy | Project |
CLEVR Counts | 2017 | 8 | 2,000 | 500 | Accuracy | Project |
PatchCamelyon | 2018 | 2 | 294,912 | 32,768 | Accuracy | Project |
EuroSAT | 2019 | 10 | 10,000 | 5,000 | Accuracy | Project |
Hateful Memes | 2020 | 2 | 8,500 | 500 | ROC AUC | Project |
Country211 | 2021 | 211 | 43,200 | 21,100 | Accuracy | Project |
Image-Text Retrieval
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
Flickr30k | 2014 | - | 31,783 | - | Recall | Project |
COCO Caption | 2015 | - | 82,783 | 5,000 | Recall | Project |
Action Recognition
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
UCF101 | 2012 | 101 | 9,537 | 1,794 | Accuracy | Project |
Kinetics700 | 2019 | 700 | 494,801 | 31,669 | Mean (top1, top5) | Project |
RareAct | 2020 | 122 | 7,607 | - | mWAP, mSAP | Project |
Object Detection
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,000 | 5,000 | Box mAP | Project |
LVIS | 2019 | 1203 | 118,000 | 5,000 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Semantic Segmentation
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |
Vision-Language Pre-training Methods
Pre-training with Contrastive Objective
Pre-training with Generative Objective
Paper | Published in | Code/Project |
---|---|---|
FLAVA: A Foundational Language And Vision Alignment Model | CVPR 2022 | Code |
CoCa: Contrastive Captioners are Image-Text Foundation Models | arXiv 2022 | Code |
Too Large; Data Reduction for Vision-Language Pre-Training | arXiv 2023 | Code |
SAM: Segment Anything | arXiv 2023 | Code |
SEEM: Segment Everything Everywhere All at Once | arXiv 2023 | Code |
Semantic-SAM: Segment and Recognize Anything at Any Granularity | arXiv 2023 | Code |
Pre-training with Alignment Objective
Paper | Published in | Code/Project |
---|---|---|
GLIP: Grounded Language-Image Pre-training | CVPR 2022 | Code |
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection | NeurIPS 2022 | - |
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training | CVPR 2023 | Code |
Vision-Language Model Transfer Learning Methods
Transfer with Prompt Tuning
Transfer with Text Prompt Tuning
Transfer with Visual Prompt Tuning
Paper | Published in | Code/Project |
---|---|---|
Exploring Visual Prompts for Adapting Large-Scale Models | arXiv 2022 | Code |
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification | arXiv 2023 | - |
Fine-Grained Visual Prompting | arXiv 2023 | - |
Transfer with Text and Visual Prompt Tuning
Paper | Published in | Code/Project |
---|---|---|
UPT: Unified Vision and Language Prompt Learning | arXiv 2022 | Code |
MVLPT: Multitask Vision-Language Prompt Tuning | arXiv 2022 | Code |
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model | arXiv 2022 | Code |
MaPLe: Multi-modal Prompt Learning | CVPR 2023 | Code |