/VLM_survey

Vision-Language Models for Vision Tasks: A Survey

Vision Language Models for Vision Tasks: A Survey

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey
[Paper]

arXiv Maintenance PR's Welcome

Feel free to contact us or pull requests if you find any related papers that are not included here.

News

Last update on 2023/12/03

VLM Pre-training Methods

  • [ICCV 2023] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [Paper][Code]
  • [ICCV 2023] GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training [Paper]

VLM Transfer Learning Methods

  • [ICCV 2023] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models [Paper][Code]
  • [ICCV 2023] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? [Paper][Code]
  • [ICCV 2023] PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization [Paper][Code]
  • [ICCV 2023] Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models [Paper]
  • [ICCV 2023] PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation [Paper]
  • [ICCV 2023] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [Paper]
  • [ICCV 2023] Read-only Prompt Optimization for Vision-Language Few-shot Learning [Paper][Code]
  • [ICCV 2023] Bayesian Prompt Learning for Image-Language Model Generalization [Paper][Code]
  • [ICCV 2023] LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models [Paper][Code]
  • [ICCV 2023] Distribution-Aware Prompt Tuning for Vision-Language Models [Paper][Code]
  • [ICCV 2023] Black Box Few-Shot Adaptation for Vision-Language models [Paper][Code]
  • [ICCVW 2023] AD-CLIP: Adapting Domains in Prompt Space Using CLIP [Paper]
  • [ICLR 2023] LPT: Long-Tailed Prompt Tuning For Image Classification [Paper][Code]
  • [arXiv 2023] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [Paper][Code]
  • [arXiv 2023] Language Models as Black-Box Optimizers for Vision-Language Models [Paper]
  • [arXiv 2023] HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding [Paper] [Code]
  • [arXiv 2023] CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models [Paper]
  • [arXiv 2023] Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [Paper] [Code]

VLM Knowledge Distillation for Detection

  • [ICCV 2023] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [Paper][Code]
  • [arXiv 2023] Improving Pseudo Labels for Open-Vocabulary Object Detection [Paper]

VLM Knowledge Distillation for Segmentation

  • [ICCV 2023] SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning [Paper][Code]
  • [arXiv 2023] ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation [Paper]
  • [arXiv 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP [Paper][Code]
  • [arXiv 2023] Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models [Paper]

VLM Knowledge Distillation for Other Vision Tasks

  • [arXiv 2023] Controlling Vision-Language Models for Universal Image Restoration [Paper][Code]

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2023vision,
  title={Vision-Language Models for Vision Tasks: A Survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={arXiv preprint arXiv:2304.00685},
  year={2023}
}

Menu

Datasets

Datasets for VLM Pre-training

Dataset Year Num of Image-Text Paris Language Project
SBU Caption 2011 1M English Project
COCO Caption 2016 1.5M English Project
Yahoo Flickr Creative Commons 100 Million 2016 100M English Project
Visual Genome 2017 5.4M English Project
Conceptual Captions 3M 2018 3.3M English Project
Localized Narratives 2020 0.87M English Project
Conceptual 12M 2021 12M English Project
Wikipedia-based Image Text 2021 37.6M 108 Languages Project
Red Caps 2021 12M English Project
LAION400M 2021 400M English Project
LAION5B 2022 5B Over 100 Languages Project
WuKong 2022 100M Chinese Project
CLIP 2021 400M English -
ALIGN 2021 1.8B English -
FILIP 2021 300M English -
WebLI 2022 12B English -

Datasets for VLM Evaluation

Image Classification

Dataset Year Classes Training Testing Evaluation Metric Project
MNIST 1998 10 60,000 10,000 Accuracy Project
Caltech-101 2004 102 3,060 6,085 Mean Per Class Project
PASCAL VOC 2007 2007 20 5,011 4,952 11-point mAP Project
Oxford 102 Flowers 2008 102 2,040 6,149 Mean Per Class Project
CIFAR-10 2009 10 50,000 10,000 Accuracy Project
CIFAR-100 2009 100 50,000 10,000 Accuracy Project
ImageNet-1k 2009 1000 1,281,167 50,000 Accuracy Project
SUN397 2010 397 19,850 19,850 Accuracy Project
SVHN 2011 10 73,257 26,032 Accuracy Project
STL-10 2011 10 1,000 8,000 Accuracy Project
GTSRB 2011 43 26,640 12,630 Accuracy Project
KITTI Distance 2012 4 6,770 711 Accuracy Project
IIIT5k 2012 36 2,000 3,000 Accuracy Project
Oxford-IIIT PETS 2012 37 3,680 3,669 Mean Per Class Project
Stanford Cars 2013 196 8,144 8,041 Accuracy Project
FGVC Aircraft 2013 100 6,667 3,333 Mean Per Class Project
Facial Emotion 2013 8 32,140 3,574 Accuracy Project
Rendered SST2 2013 2 7,792 1,821 Accuracy Project
Describable Textures 2014 47 3,760 1,880 Accuracy Project
Food-101 2014 101 75,750 25,250 Accuracy Project
Birdsnap 2014 500 42,283 2,149 Accuracy Project
RESISC45 2017 45 3,150 25,200 Accuracy Project
CLEVR Counts 2017 8 2,000 500 Accuracy Project
PatchCamelyon 2018 2 294,912 32,768 Accuracy Project
EuroSAT 2019 10 10,000 5,000 Accuracy Project
Hateful Memes 2020 2 8,500 500 ROC AUC Project
Country211 2021 211 43,200 21,100 Accuracy Project

Image-Text Retrieval

Dataset Year Classes Training Testing Evaluation Metric Project
Flickr30k 2014 - 31,783 - Recall Project
COCO Caption 2015 - 82,783 5,000 Recall Project

Action Recognition

Dataset Year Classes Training Testing Evaluation Metric Project
UCF101 2012 101 9,537 1,794 Accuracy Project
Kinetics700 2019 700 494,801 31,669 Mean (top1, top5) Project
RareAct 2020 122 7,607 - mWAP, mSAP Project

Object Detection

Dataset Year Classes Training Testing Evaluation Metric Project
COCO 2014 Detection 2014 80 83,000 41,000 Box mAP Project
COCO 2017 Detection 2017 80 118,000 5,000 Box mAP Project
LVIS 2019 1203 118,000 5,000 Box mAP Project
ODinW 2022 314 132,413 20,070 Box mAP Project

Semantic Segmentation

Dataset Year Classes Training Testing Evaluation Metric Project
PASCAL VOC 2012 2012 20 1,464 1,449 mIoU Project
PASCAL Content 2014 459 4,998 5,105 mIoU Project
Cityscapes 2016 19 2,975 500 mIoU Project
ADE20k 2017 150 25,574 2,000 mIoU Project

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Paper Published in Code/Project
CLIP: Learning Transferable Visual Models From Natural Language Supervision ICML 2021 Code
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision ICML 2021 -
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation arXiv 2021 Code
Florence: A New Foundation Model for Computer Vision arXiv 2021 -
RegionClip: Region-based Language-Image Pretraining arXiv 2021 Code
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm ICLR 2022 Code
FILIP: Fine-grained Interactive Language-Image Pre-Training ICLR 2022 -
KELIP: Large-scale Bilingual Language-Image Contrastive Learning ICLRW 2022 Code
ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources ECCV 2022 Code
SLIP: Self-supervision meets Language-Image Pre-training ECCV 2022 Code
UniCL: Unified Contrastive Learning in Image-Text-Label Space CVPR 2022 Code
LiT: Zero-Shot Transfer with Locked-image text Tuning CVPR 2022 Code
GroupViT: Semantic Segmentation Emerges from Text Supervision CVPR 2022 Code
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining NeurIPS 2022 -
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training NeurIPS 2022 -
K-LITE: Learning Transferable Visual Models with External Knowledge NeurIPS 2022 Code
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone NeurIPS 2022 Code
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese arXiv 2022 Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities arXiv 2022 Code
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation arXiv 2022 Code
CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models ICCV 2023 -
NLIP: Noise-robust Language-Image Pre-training AAAI 2023 -
PaLI: A Jointly-Scaled Multilingual Language-Image Model ICLR 2023 Project
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention ICLR 2023 Code
CLIPPO: Image-and-Language Understanding from Pixels Only CVPR 2023 Code
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training CVPR 2023 -
DeAR: Debiasing Vision-Language Models with Additive Residuals CVPR 2023 -
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training CVPR 2023 Code
LaCLIP: Improving CLIP Training with Language Rewrites NeurIPS 2023 Code

Pre-training with Generative Objective

Paper Published in Code/Project
FLAVA: A Foundational Language And Vision Alignment Model CVPR 2022 Code
CoCa: Contrastive Captioners are Image-Text Foundation Models arXiv 2022 Code
Too Large; Data Reduction for Vision-Language Pre-Training arXiv 2023 Code
SAM: Segment Anything arXiv 2023 Code
SEEM: Segment Everything Everywhere All at Once arXiv 2023 Code
Semantic-SAM: Segment and Recognize Anything at Any Granularity arXiv 2023 Code

Pre-training with Alignment Objective

Paper Published in Code/Project
GLIP: Grounded Language-Image Pre-training CVPR 2022 Code
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection NeurIPS 2022 -
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training CVPR 2023 Code

Vision-Language Model Transfer Learning Methods

Transfer with Prompt Tuning

Transfer with Text Prompt Tuning

Paper Published in Code/Project
CoOp: Learning to Prompt for Vision-Language Models IJCV 2022 Code
CoCoOp: Conditional Prompt Learning for Vision-Language Models CVPR 2022 Code
ProDA: Prompt Distribution Learning CVPR 2022 -
DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting CVPR 2022 Code
TPT: Test-time prompt tuning for zero-shot generalization in vision-language models NeurIPS 2022 Code
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations NeurIPS 2022 Code
CPL: Counterfactual Prompt Learning for Vision and Language Models EMNLP 2022 Code
Bayesian Prompt Learning for Image-Language Model Generalization arXiv 2022 -
UPL: Unsupervised Prompt Learning for Vision-Language Models arXiv 2022 Code
ProGrad: Prompt-aligned Gradient for Prompt Tuning arXiv 2022 Code
SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models arXiv 2022 Code
SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models TCSVT 2023 Code
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models CVPR 2023 Code
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models ICLR 2023 Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition arXiv 2023 Code
Texts as Images in Prompt Tuning for Multi-Label Image Recognition CVPR 2023 code
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization CVPR 2023 Code
Learning to Name Classes for Vision and Language Models CVPR 2023 -
CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification ICCV 2023 Code
ProTeCt: Prompt Tuning for Hierarchical Consistency arXiv 2023 -
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning arXiv 2023 Code

Transfer with Visual Prompt Tuning

Paper Published in Code/Project
Exploring Visual Prompts for Adapting Large-Scale Models arXiv 2022 Code
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification arXiv 2023 -
Fine-Grained Visual Prompting arXiv 2023 -

Transfer with Text and Visual Prompt Tuning

Paper Published in Code/Project
UPT: Unified Vision and Language Prompt Learning arXiv 2022 Code
MVLPT: Multitask Vision-Language Prompt Tuning arXiv 2022 Code
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model arXiv 2022 Code
MaPLe: Multi-modal Prompt Learning CVPR 2023 Code

Transfer with Feature Adapter

Paper Published in Code/Project
Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification ECCV 2022 Code
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models BMVC 2022 Code
Clip-Adapter: Better Vision-Language Models with Feature Adapters arXiv 2021 Code
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models ICCV 2023 Code
CLIPPR: Improving Zero-Shot Models with Label Distribution Priors arXiv 2022 Code
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification arXiv 2022 -
SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More arXiv 2023 Code
Segment Anything in High Quality arXiv 2023 Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding arXiv 2023 Code
CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models arXiv 2023 -

Transfer with Other Methods

Paper Published in Code/Project
VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts arXiv 2021 -
Wise-FT: Robust fine-tuning of zero-shot models CVPR 2022 Code
MaskCLIP: Extract Free Dense Labels from CLIP ECCV 2022 Code
MUST: Masked Unsupervised Self-training for Label-free Image Classification ICLR 2023 Code
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention AAAI 2023 Code
Semantic Prompt for Few-Shot Image Recognition CVPR 2023 -
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners CVPR 2023 Code
Task Residual for Tuning Vision-Language Models CVPR 2023 Code
Deeply Coupled Cross-Modal Prompt Learning ACL 2023 Code
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation arXiv 2023 -
Personalize Segment Anything Model with One Shot arXiv 2023 Code
Chils: Zero-shot image classification with hierarchical label sets ICML 2023 Code
Improving Zero-shot Generalization and Robustness of Multi-modal Models CVPR 2023 Code
Exploiting Category Names for Few-Shot Classification with Vision-Language Models ICLR W 2023 -
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models arXiv 2023 Code

Vision-Language Model Knowledge Distillation Methods

Knowledge Distillation for Object Detection

Paper Published in Code/Project
ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation ICLR 2022 Code
DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model CVPR 2022 Code
XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling CVPR 2022 Code
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection NeurIPS 2022 Code
PromptDet: Towards Open-vocabulary Detection using Uncurated Images ECCV 2022 Code
PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels ECCV 2022 Code
OV-DETR: Open-Vocabulary DETR with Conditional Matching ECCV 2022 Code
Detic: Detecting Twenty-thousand Classes using Image-level Supervision ECCV 2022 Code
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers ECCV 2022 Code
VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection ECCV 2022 Code
ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment arXiv 2022 Code
HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation arXiv 2022 Code
VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection ICLR 2023 Code
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models ICLR 2023 Code
CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection CVPR 2023 -
Aligning Bag of Regions for Open-Vocabulary Object Detection CVPR 2023 Code
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers CVPR 2023 Code
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection CVPR 2023 Code
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching CVPR 2023 Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment CVPR 2023 -
Detecting Everything in the Open World: Towards Universal Object Detection CVPR 2023 Code
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining CVPR 2023 -
Contextual Object Detection with Multimodal Large Language Models arXiv 2023 Code
Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models arXiv 2023 Code

Knowledge Distillation for Semantic Segmentation

Paper Published in Code/Project
SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples arXiv 2021 -
ReCo: Retrieve and Co-segment for Zero-shot Transfer NeurIPS 2022 Code
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation CVPR 2022 Code
CLIPSeg: Image Segmentation Using Text and Image Prompts CVPR 2022 Code
ZegFormer: Decoupling Zero-Shot Semantic Segmentation CVPR 2022 Code
LSeg: Language-driven Semantic Segmentation ICLR 2022 Code
ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model ECCV 2022 Code
OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels ECCV 2022 Code
Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models BMVC 2022 Code
OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP CVPR 2023 Code
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation CVPR 2023 Code
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation CVPR 2023 Code
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation CVPR 2023 Code
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations CVPR 2023 Code
Exploring Open-Vocabulary Semantic Segmentation without Human Labels arXiv 2023 -
OpenVIS: Open-vocabulary Video Instance Segmentation arXiv 2023 -
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation arXiv 2023 -
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation arXiv 2023 Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models arXiv 2023 -