VLM_survey: A repository from blesspo

Vision Language Models for Vision Tasks: A Survey

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey
[Paper]

Feel free to contact us or pull requests if you find any related papers that are not included here.

News

Last update on 2023/7/23

VLM Pre-training Methods

[CVPR 2023] RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training [Paper]
[CVPR 2023] DeAR: Debiasing Vision-Language Models with Additive Residuals [Paper]
[CVPR 2023] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training [Paper][Code]
[arXiv 2023] Improving CLIP Training with Language Rewrites [Paper][Code]
[arXiv 2023] Too Large; Data Reduction for Vision-Language Pre-Training [Paper][Code]
[arXiv 2023] Segment Anything [Paper][Code]
[arXiv 2023] Semantic-SAM: Segment and Recognize Anything at Any Granularity [Paper][Code]
[arXiv 2023] Segment Everything Everywhere All at Once [Paper][Code]

VLM Transfer Learning Methods

[CVPR 2023] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [Paper][Code]
[CVPR 2023] Learning to Name Classes for Vision and Language Models [Paper]
[CVPR 2023] Semantic Prompt for Few-Shot Image Recognition [Paper]
[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners [Paper][Code]
[CVPR 2023] Task Residual for Tuning Vision-Language Models [Paper][Code]
[arXiv 2023] ProTeCt: Prompt Tuning for Hierarchical Consistency [Paper]
[arXiv 2023] Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification [Paper]
[arXiv 2023] Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning [Paper][Code]
[arXiv 2023] Fine-Grained Visual Prompting [Paper]
[ACL 2023] Deeply Coupled Cross-Modal Prompt Learning [Paper][Code]
[arXiv 2023] SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More [Paper][Code]
[arXiv 2023] Segment Anything in High Quality [Paper][Code]
[arXiv 2023] Personalize Segment Anything Model with One Shot [Paper][Code]
[arXiv 2023] Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation [Paper]

VLM Knowledge Distillation for Detection

[CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper]
[CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]
[CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]
[CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
[CVPR 2023] Detecting Everything in the Open World: Towards Universal Object Detection [Paper][Code]
[CVPR 2023] CapDet: Unifying Dense Captioning and Open-World Detection Pretraining [Paper]
[arXiv 2023] Contextual Object Detection with Multimodal Large Language Models [Paper][Code]
[arXiv 2023] Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models [Paper][Code]

VLM Knowledge Distillation for Segmentation

[CVPR 2023] FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation [Paper][Code]
[CVPR 2023] Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations [Paper][Code]
[arXiv 2023] Exploring Open-Vocabulary Semantic Segmentation without Human Labels [Paper]
[arXiv 2023] OpenVIS: Open-vocabulary Video Instance Segmentation [Paper]
[arXiv 2023] Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation [Paper]
[arXiv 2023] Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation [Paper][Code]

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2023vision,
  title={Vision-Language Models for Vision Tasks: A Survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={arXiv preprint arXiv:2304.00685},
  year={2023}
}

Datasets
- Datasets for VLM Pre-training
- Datasets for VLM Evaluation
Vision-Language Pre-training Methods
Vision-Language Model Transfer Learning Methods
Vision-Language Model Knowledge Distillation Methods
- Knowledge Distillation for Object Detection
- Knowledge Distillation for Semantic Segmentation

Datasets

Datasets for VLM Pre-training

Dataset	Year	Num of Image-Text Paris	Language	Project
SBU Caption	2011	1M	English	Project
COCO Caption	2016	1.5M	English	Project
Yahoo Flickr Creative Commons 100 Million	2016	100M	English	Project
Visual Genome	2017	5.4M	English	Project
Conceptual Captions 3M	2018	3.3M	English	Project
Localized Narratives	2020	0.87M	English	Project
Conceptual 12M	2021	12M	English	Project
Wikipedia-based Image Text	2021	37.6M	108 Languages	Project
Red Caps	2021	12M	English	Project
LAION400M	2021	400M	English	Project
LAION5B	2022	5B	Over 100 Languages	Project
WuKong	2022	100M	Chinese	Project
CLIP	2021	400M	English	-
ALIGN	2021	1.8B	English	-
FILIP	2021	300M	English	-
WebLI	2022	12B	English	-

Datasets for VLM Evaluation

Image Classification

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
MNIST	1998	10	60,000	10,000	Accuracy	Project
Caltech-101	2004	102	3,060	6,085	Mean Per Class	Project
PASCAL VOC 2007	2007	20	5,011	4,952	11-point mAP	Project
Oxford 102 Flowers	2008	102	2,040	6,149	Mean Per Class	Project
CIFAR-10	2009	10	50,000	10,000	Accuracy	Project
CIFAR-100	2009	100	50,000	10,000	Accuracy	Project
ImageNet-1k	2009	1000	1,281,167	50,000	Accuracy	Project
SUN397	2010	397	19,850	19,850	Accuracy	Project
SVHN	2011	10	73,257	26,032	Accuracy	Project
STL-10	2011	10	1,000	8,000	Accuracy	Project
GTSRB	2011	43	26,640	12,630	Accuracy	Project
KITTI Distance	2012	4	6,770	711	Accuracy	Project
IIIT5k	2012	36	2,000	3,000	Accuracy	Project
Oxford-IIIT PETS	2012	37	3,680	3,669	Mean Per Class	Project
Stanford Cars	2013	196	8,144	8,041	Accuracy	Project
FGVC Aircraft	2013	100	6,667	3,333	Mean Per Class	Project
Facial Emotion	2013	8	32,140	3,574	Accuracy	Project
Rendered SST2	2013	2	7,792	1,821	Accuracy	Project
Describable Textures	2014	47	3,760	1,880	Accuracy	Project
Food-101	2014	101	75,750	25,250	Accuracy	Project
Birdsnap	2014	500	42,283	2,149	Accuracy	Project
RESISC45	2017	45	3,150	25,200	Accuracy	Project
CLEVR Counts	2017	8	2,000	500	Accuracy	Project
PatchCamelyon	2018	2	294,912	32,768	Accuracy	Project
EuroSAT	2019	10	10,000	5,000	Accuracy	Project
Hateful Memes	2020	2	8,500	500	ROC AUC	Project
Country211	2021	211	43,200	21,100	Accuracy	Project

Image-Text Retrieval

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
Flickr30k	2014	-	31,783	-	Recall	Project
COCO Caption	2015	-	82,783	5,000	Recall	Project

Action Recognition

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
UCF101	2012	101	9,537	1,794	Accuracy	Project
Kinetics700	2019	700	494,801	31,669	Mean (top1, top5)	Project
RareAct	2020	122	7,607	-	mWAP, mSAP	Project

Object Detection

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
COCO 2014 Detection	2014	80	83,000	41,000	Box mAP	Project
COCO 2017 Detection	2017	80	118,000	5,000	Box mAP	Project
LVIS	2019	1203	118,000	5,000	Box mAP	Project
ODinW	2022	314	132,413	20,070	Box mAP	Project

Semantic Segmentation

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
PASCAL VOC 2012	2012	20	1,464	1,449	mIoU	Project
PASCAL Content	2014	459	4,998	5,105	mIoU	Project
Cityscapes	2016	19	2,975	500	mIoU	Project
ADE20k	2017	150	25,574	2,000	mIoU	Project

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Paper	Published in	Code/Project
CLIP: Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	Code
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	ICML 2021	-
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation	arXiv 2021	Code
Florence: A New Foundation Model for Computer Vision	arXiv 2021	-
RegionClip: Region-based Language-Image Pretraining	arXiv 2021	Code
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm	ICLR 2022	Code
FILIP: Fine-grained Interactive Language-Image Pre-Training	ICLR 2022	-
KELIP: Large-scale Bilingual Language-Image Contrastive Learning	ICLRW 2022	Code
ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources	ECCV 2022	Code
SLIP: Self-supervision meets Language-Image Pre-training	ECCV 2022	Code
UniCL: Unified Contrastive Learning in Image-Text-Label Space	CVPR 2022	Code
LiT: Zero-Shot Transfer with Locked-image text Tuning	CVPR 2022	Code
GroupViT: Semantic Segmentation Emerges from Text Supervision	CVPR 2022	Code
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining	NeurIPS 2022	-
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training	NeurIPS 2022	-
K-LITE: Learning Transferable Visual Models with External Knowledge	NeurIPS 2022	Code
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone	NeurIPS 2022	Code
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese	arXiv 2022	Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	arXiv 2022	Code
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation	arXiv 2022	Code
CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models	ICCV 2023	-
NLIP: Noise-robust Language-Image Pre-training	AAAI 2023	-
PaLI: A Jointly-Scaled Multilingual Language-Image Model	ICLR 2023	Project
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention	ICLR 2023	Code
CLIPPO: Image-and-Language Understanding from Pixels Only	CVPR 2023	Code
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training	CVPR 2023	-
DeAR: Debiasing Vision-Language Models with Additive Residuals	CVPR 2023	-
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training	CVPR 2023	Code
LaCLIP: Improving CLIP Training with Language Rewrites	arXiv 2023	Code

Pre-training with Generative Objective

Paper	Published in	Code/Project
FLAVA: A Foundational Language And Vision Alignment Model	CVPR 2022	Code
CoCa: Contrastive Captioners are Image-Text Foundation Models	arXiv 2022	Code
Too Large; Data Reduction for Vision-Language Pre-Training	arXiv 2023	Code
SAM: Segment Anything	arXiv 2023	Code
SEEM: Segment Everything Everywhere All at Once	arXiv 2023	Code
Semantic-SAM: Segment and Recognize Anything at Any Granularity	arXiv 2023	Code

Pre-training with Alignment Objective

Paper	Published in	Code/Project
GLIP: Grounded Language-Image Pre-training	CVPR 2022	Code
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection	NeurIPS 2022	-
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training	CVPR 2023	Code

Vision-Language Model Transfer Learning Methods

Transfer with Prompt Tuning

Transfer with Text Prompt Tuning

Paper	Published in	Code/Project
CoOp: Learning to Prompt for Vision-Language Models	IJCV 2022	Code
CoCoOp: Conditional Prompt Learning for Vision-Language Models	CVPR 2022	Code
ProDA: Prompt Distribution Learning	CVPR 2022	-
DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting	CVPR 2022	Code
TPT: Test-time prompt tuning for zero-shot generalization in vision-language models	NeurIPS 2022	Code
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations	NeurIPS 2022	Code
CPL: Counterfactual Prompt Learning for Vision and Language Models	EMNLP 2022	Code
Bayesian Prompt Learning for Image-Language Model Generalization	arXiv 2022	-
CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification	arXiv 2022	Code
UPL: Unsupervised Prompt Learning for Vision-Language Models	arXiv 2022	Code
ProGrad: Prompt-aligned Gradient for Prompt Tuning	arXiv 2022	Code
SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models	arXiv 2022	Code
SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models	TCSVT 2023	Code
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models	CVPR 2023	Code
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models	ICLR 2023	Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition	arXiv 2023	Code
Texts as Images in Prompt Tuning for Multi-Label Image Recognition	CVPR 2023	code
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization	CVPR 2023	Code
Learning to Name Classes for Vision and Language Models	CVPR 2023	-
ProTeCt: Prompt Tuning for Hierarchical Consistency	arXiv 2023	-
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning	arXiv 2023	Code

Transfer with Visual Prompt Tuning

Paper	Published in	Code/Project
Exploring Visual Prompts for Adapting Large-Scale Models	arXiv 2022	Code
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification	arXiv 2023	-
Fine-Grained Visual Prompting	arXiv 2023	-

Transfer with Text and Visual Prompt Tuning

Paper	Published in	Code/Project
UPT: Unified Vision and Language Prompt Learning	arXiv 2022	Code
MVLPT: Multitask Vision-Language Prompt Tuning	arXiv 2022	Code
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model	arXiv 2022	Code
MaPLe: Multi-modal Prompt Learning	CVPR 2023	Code

Transfer with Feature Adapter

Paper	Published in	Code/Project
Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification	ECCV 2022	Code
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models	BMVC 2022	Code
Clip-Adapter: Better Vision-Language Models with Feature Adapters	arXiv 2021	Code
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	ICCV 2023	Code
CLIPPR: Improving Zero-Shot Models with Label Distribution Priors	arXiv 2022	Code
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification	arXiv 2022	-
SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More	arXiv 2023	Code
Segment Anything in High Quality	arXiv 2023	Code

Transfer with Other Methods

Paper	Published in	Code/Project
VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts	arXiv 2021	-
Wise-FT: Robust fine-tuning of zero-shot models	CVPR 2022	Code
MaskCLIP: Extract Free Dense Labels from CLIP	ECCV 2022	Code
MUST: Masked Unsupervised Self-training for Label-free Image Classification	ICLR 2023	Code
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention	AAAI 2023	Code
Semantic Prompt for Few-Shot Image Recognition	CVPR 2023	-
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	CVPR 2023	Code
Task Residual for Tuning Vision-Language Models	CVPR 2023	Code
Deeply Coupled Cross-Modal Prompt Learning	ACL 2023	Code
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation	arXiv 2023	-
Personalize Segment Anything Model with One Shot	arXiv 2023	Code
Chils: Zero-shot image classification with hierarchical label sets	ICML 2023	Code
Improving Zero-shot Generalization and Robustness of Multi-modal Models	CVPR 2023	Code
Exploiting Category Names for Few-Shot Classification with Vision-Language Models	ICLR W 2023	-

Vision-Language Model Knowledge Distillation Methods

Knowledge Distillation for Object Detection

Paper	Published in	Code/Project
ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation	ICLR 2022	Code
DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model	CVPR 2022	Code
XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling	CVPR 2022	Code
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection	NeurIPS 2022	Code
PromptDet: Towards Open-vocabulary Detection using Uncurated Images	ECCV 2022	Code
PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels	ECCV 2022	Code
OV-DETR: Open-Vocabulary DETR with Conditional Matching	ECCV 2022	Code
Detic: Detecting Twenty-thousand Classes using Image-level Supervision	ECCV 2022	Code
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers	ECCV 2022	Code
VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection	ECCV 2022	Code
ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment	arXiv 2022	Code
HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation	arXiv 2022	Code
VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection	ICLR 2023	Code
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models	ICLR 2023	Code
CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection	CVPR 2023	-
Aligning Bag of Regions for Open-Vocabulary Object Detection	CVPR 2023	Code
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	CVPR 2023	-
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection	CVPR 2023	Code
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching	CVPR 2023	Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment	CVPR 2023	-
Detecting Everything in the Open World: Towards Universal Object Detection	CVPR 2023	Code
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining	CVPR 2023	-
Contextual Object Detection with Multimodal Large Language Models	arXiv 2023	Code
Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models	arXiv 2023	Code

Knowledge Distillation for Semantic Segmentation

Paper	Published in	Code/Project
SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples	arXiv 2021	-
ReCo: Retrieve and Co-segment for Zero-shot Transfer	NeurIPS 2022	Code
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation	CVPR 2022	Code
CLIPSeg: Image Segmentation Using Text and Image Prompts	CVPR 2022	Code
ZegFormer: Decoupling Zero-Shot Semantic Segmentation	CVPR 2022	Code
LSeg: Language-driven Semantic Segmentation	ICLR 2022	Code
ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model	ECCV 2022	Code
OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels	ECCV 2022	Code
Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models	BMVC 2022	Code
OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP	CVPR 2023	Code
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation	CVPR 2023	Code
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation	CVPR 2023	Code
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation	CVPR 2023	Code
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	CVPR 2023	Code
Exploring Open-Vocabulary Semantic Segmentation without Human Labels	arXiv 2023	-
OpenVIS: Open-vocabulary Video Instance Segmentation	arXiv 2023	-
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation	arXiv 2023	-
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation	arXiv 2023	Code

blesspo/VLM_survey