Towards Open Vocabulary Learning: A Survey

arXiv, 2023
Jianzong Wu ^* . Xiangtai Li ^* · Shilin Xu ^* · Haobo Yuan ^* · Henghui Ding · Yibo Yang · Xia Li · Jiangning Zhang · Yunhai Tong · Xudong Jiang · Bernard Ghanem · Dacheng Tao ·

This repo is used for recording, tracking, and benchmarking several recent open vocabulary methods to supplement our survey.
If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🔥Add Your Paper in our Repo and Survey!!!!!

[-] You are welcome to give us an issue or PR for your open vocabulary learning work !!!!!

[-] Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

[-] Our survey will be updated in 2024.3.

🔥New

[-] We update GitHub to record the available paper by the end of 2023/7/20.

🔥Highlight!!

[1] The first survey for open vocabulary learning, including open vocabulary detection/segmentation/tracking.

[2] It also contains several related domains, including foundation model tuning and open-world detection.

[3] We list detailed results for the most representative works and give a more fair and clearer comparison of different approaches.

Introduction

This survey presents the first detailed survey on open vocabulary tasks, including open-vocabulary object detection, open-vocabulary segmentation, and 3D/video open-vocabulary tasks.

Methods: A Survey

Keywords

cap.: Use caption as auxiliary training data
vlm.: Use pretrained VLMs like CLIP
pl.: Generate pseudo labels
w/o ps.: Training without pixel-level supervision
pre.: Vision-language pretraining
diff.: Use diffusion models
unify: Unify several tasks (semantic segmentation, instance segmentation, and panoptic segmentation)
sam: Use SAM (Segment Anything Model)
open.: Demonstrated with open-set capability. (only for Video Understanding)
audio.: With audio modality.
other: Other methods that cannot be grouped into above ones.

Open Vocabulary Object Detection

Year	Venue	Keywords	Paper Title	Code/Project
2021	CVPR	`cap.`	Open-Vocabulary Object Detection Using Captions	Code
2022	ICLR	`vlm.`	Open-vocabulary Object Detection via Vision and Language Knowledge Distillation	Code
2022	CVPR	`cap.`, `vlm.`, `pre.`	RegionCLIP: Region-based Language-Image Pretraining	Code
2022	CVPR	`vlm.`	Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model	Code
2022	CVPR	`vlm.`, `cap.`	Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation	Code
2022	CVPR	`cap.`, `vlm.`	Grounded Language-Image Pre-training	[Code]
2022	NeurIPS	`cap.`, `vlm.`	GLIPv2: Unifying Localization and VL Understanding	Code
2022	GCPR	`cap.`	Localized Vision-Language Matching for Open-vocabulary Object Detection	Code
2022	ECCV	`vlm.`	Open-Vocabulary DETR with Conditional Matching	Code
2022	ECCV	`vlm.`, `cap.`, `pl.`	Open Vocabulary Object Detection with Pseudo Bounding-Box Labels	Code
2022	ECCV	`vlm.`	Promptdet: Towards open-vocabulary detection using uncurated images	Code
2022	ECCV	`vlm.`, `pl.`, `w/o ps.`	Detecting Twenty-thousand Classes using Image-level Supervision	Code
2022	ECCV	`vlm.`. `pl.`	Exploiting unlabeled data with vision and language models for object detection	Code
2022	ECCV	`vlm.`, `cap.`	Simple Open-Vocabulary Object Detection with Vision Transformers	Code
2022	NeurIPS	`vlm.`, `pl.`	Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection	Code
2022	NeurIPS	`vlm.`, `cap.`	DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection	N/A
2022	arXiv	`vlm.`	Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization	Code
2022	arXiv	`vlm.`, `pl.`	P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection	N/A
2023	ICLR	`vlm.`, `pl.`	Learning Object-Language Alignments for Open-Vocabulary Object Detection	Code
2023	ICLR	`vlm.`	F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models	Code
2023	CVPR	`other.`, `vlm.`	Learning to Detect and Segment for Open Vocabulary Object Detection	N/A
2023	CVPR	`vlm.`, `cap.`	Aligning Bag of Regions for Open-Vocabulary Object Detection	Code
2023	CVPR	`vlm.`	Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection	Code
2023	CVPR	`vlm.`	CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching	N/A
2023	CVPR	`vlm.`, `pl.`	DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment	N/A
2023	CVPR	`vlm.`	Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	N/A
2023	ICML	`vlm.`	Multi-Modal Classifiers for Open-Vocabulary Object Detection	Project
2023	arXiv	`vlm.`	GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning	N/A
2023	arXiv	`vlm.`, `cap.`	Enhancing the Role of Context in Region-Word Alignment for Object Detection	N/A
2023	arXiv	`cap.`, `pl.`	Open-Vocabulary Object Detection using Pseudo Caption Labels	N/A
2023	arXiv	`vlm.`, `pl.`	Three ways to improve feature alignment for open vocabulary detection	N/A
2023	arXiv	`vlm.`	Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection	N/A
2023	arXiv	`vlm.`, `cap.`, `pl.`	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	N/A
2023	arXiv	`vlm.`, `cap.`, `pl.`	Scaling Open-Vocabulary Object Detection	N/A
2023	arXiv	`vlm.`	Open-Vocabulary Object Detection via Scene Graph Discovery	N/A
2023	arXiv	`unify.`, `vlm.`, `pre.`	CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction	Code
2024	AAAI	`unify.`, `vlm.`, `pre.`	CLIM: Contrastive Language-Image Mosaic for Region Representation	Code

Open Vocabulary Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	CVPR	`unify.`, `vlm.`	Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation	Code
2023	CVPR	`unify.`, `vlm.`	FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation	Code

Semantic Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2022	ICLR	`vlm.`	Language-driven Semantic Segmentation	Code
2022	CVPR	`cap.`, `w/o ps.`	GroupViT: Semantic Segmentation Emerges from Text Supervision	Code
2022	CVPR	`vlm.`	ZegFormer: Decoupling Zero-Shot Semantic Segmentation	Code
2022	ECCV	`cap.`, `vlm.`	Scaling Open-Vocabulary Image Segmentation with Image-Level Labels	N/A
2022	ECCV	`vlm`, `pl`, `w/o ps.`	Extract Free Dense Labels from CLIP	Code
2022	ECCV	`vlm.`	A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model	Code
2022	ECCV	`vlm.`, `cap.`, `w/o ps.`	Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding	N/A
2022	BMVC	`vlm.`	Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models	Code
2022	arXiv	`vlm.`, `cap.`, `pl`, `w/o ps.`	Perceptual Grouping in Contrastive Vision-Language Models	Code
2022	arXiv	`vlm.`, `cap.`, `pl`, `w/o ps.`	SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation	Code
2022	arXiv	`vlm.`, `cap.`, `w/o ps.`	Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning	N/A
2023	CVPR	`vlm.`, `pre.`	Generalized Decoding for Pixel, Image, and Language	Code
2023	CVPR	`vlm.`, `pl.`	Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP	Code
2023	CVPR	`cap.`, `vlm.`, `w/o ps.`	Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision	Code
2023	CVPR	`vlm.`	Side Adapter Network for Open-Vocabulary Semantic Segmentation	Codd
2023	arXiv	`vlm.`, `unify`	A Simple Framework for Open-Vocabulary Segmentation and Detection	Code
2023	arXiv	`vlm.`	Global Knowledge Calibration for Fast Open-Vocabulary Segmentation	N/A
2023	arXiv	`vlm.`	CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation	Code
2023	arXiv	`vlm.`, `unify`	Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition	Code
2023	arXiv	`vlm.`, `unify`	Segment Everything Everywhere All at Once	Code
2023	arXiv	`vlm.`	MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation	N/A
2023	arXiv	`vlm.`	TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation	N/A
2023	arXiv	`vlm.`, `w/o ps.`, `sam`	Exploring Open-Vocabulary Semantic Segmentation without Human Labels	N/A
2023	arXiv	`vlm.`, `unify`	DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model	N/A
2023	arXiv	`diff.`	Diffusion Models for Zero-Shot Open-Vocabulary Segmentation	Project
2023	ICCV	`diff.`	Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models	Project
2023	ICCV	`diff.`	Guiding Text-to-Image Diffusion Model Towards Grounded Generation	Project

Instance Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	CVPR	`vlm.`	Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation	Code
2022	CVPR	`cap.`, `pl.`, `vlm.`	Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling	Code
2023	CVPR	`vlm`, `cap`, `w/o ps.`	Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	Code
2023	arXiv	`cap.`	Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation	Code

Panoptic Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	CVPR	`unify.`, `vlm.`	Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation	Code
2022	arXiv	`vlm`	Open-Vocabulary Panoptic Segmentation with MaskCLIP	N/A
2023	CVPR	`diff`, `vlm`	Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models	Code
2023	arXiv	`vlm.`	Open-vocabulary Panoptic Segmentation with Embedding Modulation	N/A
2023	arXiv	`vlm.`, 'unify'	Hierarchical Open-vocabulary Universal Image Segmentation	Code

Open Vocabulary Video Understanding

Video Classification

Year	Venue	Keywords	Paper Title	Code/Project
2021	arXiv	`vlm.`,`open.`	ActionCLIP: A New Paradigm for Video Action Recognition	Code
2022	ECCV	`vlm.`,`open.`	Prompting Visual-Language Models for Efficient Video Understanding	Project
2022	ECCV	`vlm.`	Frozen CLIP Models are Efficient Video Learners	Code
2022	ECCV	`vlm.`,`open.`	Expanding Language-Image Pretrained Models for General Video Recognition	Code
2022	arXiv	`vlm.`,`open.`,`audio.`	Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models	N/A
2023	AAAI	`vlm.`,`open.`	Revisiting Classifier: Transferring Vision-Language Models for Video Recognition	Code
2023	ICLR	`vlm.`	AIM: Adapting Image Models for Efficient Video Action Recognition	Project
2023	CVPR	`vlm.`,`open.`	Fine-tuned CLIP Models are Efficient Video Learners	Code
2023	ICML	`vlm.`,`open.`	Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization	Code
2023	arXiv	`vlm.`,`open.`	Video Action Recognition with Attentive Semantic Units	N/A
2023	arXiv	`vlm.`,`open.`	VicTR: Video-conditioned Text Representations for Activity Recognition	N/A
2023	arXiv	`vlm.`,`open.`	MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge	N/A

Tracking

Year	Venue	Keywords	Paper Title	Code/Project
2023	CVPR	`vlm.`,`open.`	OVTrack: Open-Vocabulary Multiple Object Tracking	Project

Video Instance Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`vlm.`,`open.`	Towards Open-Vocabulary Video Instance Segmentation	N/A
2023	arXiv	`vlm.`,`open.`	OpenVIS: Open-vocabulary Video Instance Segmentation	N/A

Open Vocabulary 3D Scene Understanding

3D Classification

Year	Venue	Keywords	Paper Title	Code/Project
2022	CVPR	`vlm.`	PointCLIP: Point Cloud Understanding by CLIP	Code
2022	arXiv	`vlm.`	CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training	Code
2022	arXiv	`vlm.`	PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning	Code
2022	arXiv	`vlm.`	LidarCLIP or: How I Learned to Talk to Point Clouds	Code
2023	CVPR	`vlm.`	ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding	Code
2023	ICML	`vlm.`	Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining	Code

3D Detection

Year	Venue	Keywords	Paper Title	Code/Project
2022	arXiv	`vlm.`	Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning	N/A
2023	CVPR	`vlm.`	Open-Vocabulary Point-Cloud Object Detection without 3D Annotation	Code
2023	NeurIPS	`vlm.`	CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection	Project

3D segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	CVPR	`vlm.`	PLA: Language-Driven Open-Vocabulary 3D Scene Understanding	Code
2023	CVPR	`vlm.`	CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP	Code
2023	CVPR	`vlm.`	OpenScene: 3D Scene Understanding with Open Vocabularies	Project
2023	arXiv	`vlm.`	CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	N/A
2023	arXiv	`vlm.`	OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation	Project
2023	arXiv	`vlm.`	OpenMask3D: Open-Vocabulary 3D Instance Segmentation	Project

Related Domains and Beyond

Class-agnostic Detection and Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2022	RA-L	-	Learning Open-World Object Proposals without Learning to Classify	Code
2021	ICCV	-	Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation	Project
2022	CVPR	-	Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity	Project
2022	ECCV	-	Class-agnostic object detection with multi-modal transformer	Code
2022	TPAMI	-	Open World Entity Segmentation	Project
2022	arXiv	-	Fine-Grained Entity Segmentation	Project

Open-World Object Detection

Year	Venue	Keywords	Paper Title	Code/Project
2015	CVPR	-	Towards Open World Recognition	N/A
2021	CVPR	-	Towards Open World Object Detection.	Code
2022	CVPR	-	OW-DETR: Open-world Detection Transformer	Code
2022	ECCV	-	UC-OWOD: Unknown-Classified Open World Object Detection	Code
2022	arXiv	-	Revisiting Open World Object Detection	Code
2022	arXiv	-	Rectifying Open-set Object Detection: A Taxonomy, Practical Applications, and Proper Evaluation	[N/A]
2022	arXiv	-	Open World DETR: Transformer based Open World Object Detection	N/A
2022	arXiv	-	PROB: Probabilistic Objectness for Open World Object Detection	Code

Open-Set Panoptic Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2021	CVPR	-	Exemplar-Based Open-Set Panoptic Segmentation Network	Project
2022	arXiv	-	Dual Decision Improves Open-Set Panoptic Segmentation	Code

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{wu2023open,
      title={Towards Open Vocabulary Learning: A Survey}, 
      author={Jianzong Wu and Xiangtai Li and Shilin Xu and Haobo Yuan and Henghui Ding and Yibo Yang and Xia Li and Jiangning Zhang and Yunhai Tong and Xudong Jiang and Bernard Ghanem and Dacheng Tao},
      year={2023},
      journal={arXiv pre-print},
}

Contact

jzwu@stu.pku.edu.cn

xiangtai.li@ntu.edu.sg

lxtpku@pku.edu.cn

jin-s13/Awesome-Open-Vocabulary