Towards Open Vocabulary Learning: A Survey
arXiv, 2023
Jianzong Wu *
.
Xiangtai Li *
·
Shilin Xu *
·
Haobo Yuan *
·
Henghui Ding
·
Yibo Yang
·
Xia Li
·
Jiangning Zhang
·
Yunhai Tong
·
Xudong Jiang
·
Bernard Ghanem
·
Dacheng Tao
·
This repo is used for recording, tracking, and benchmarking several recent open vocabulary methods as a supplement for our survey .
If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests .
We will add the missing papers to this repo ASAP.
[1] The first survey for open vocabulary learning, including open vocabulary detection/segmentation/tracking.
[2] It also contains several related domains, including foundation model tuning and open-world detection.
[3] We list detailed results for the most representative works.
In this survey, we present the first detailed survey on the Open Vocabulary tasks, including open vocabulary object detection, open vocabulary segmentation and 3D/video open vocabulary tasks.
Keywords
cap.
: Use caption as auxiliary training data
vlm.
: Use pretrained VLMs like CLIP
pl.
: Generate pseudo labels
w/o ps.
: Training without pixel-level supervision
pre.
: Vision-language pretraining
diff.
: Use diffusion models
unify
: Unify several tasks (semantic segmentation, instance segmentation, and panoptic segmentation)
sam
: Use SAM (Segment Anything Model)
open.
: Demonstrated with open-set capability. (only for Video Understanding)
audio.
: With audio modality.
other
: Other methods that cannot be grouped into above ones.
Open Vocabulary Object Detection
Year
Venue
Keywords
Paper Title
Code/Project
2021
CVPR
cap.
Open-Vocabulary Object Detection Using Captions
Code
2021
arXiv
cap.
, vlm.
, pre.
RegionCLIP: Region-based Language-Image Pretraining
Code
2022
CVPR
vlm.
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
Code
2022
ICLR
vlm.
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Code
2022
GCPR
cap.
Localized Vision-Language Matching for Open-vocabulary Object Detection
Code
2022
ECCV
vlm.
Open-Vocabulary DETR with Conditional Matching
Code
2022
ECCV
vlm.
, cap.
, pl.
Open Vocabulary Object Detection with Pseudo Bounding-Box Labels
Code
2022
ECCV
vlm.
Promptdet: Towards open-vocabulary detection using uncurated images
Code
2022
ECCV
vlm.
, pl.
, w/o ps.
Detecting Twenty-thousand Classes using Image-level Supervision
Code
2022
ECCV
vlm.
. pl.
Exploiting unlabeled data with vision and language models for object detection
Code
2022
ECCV
vlm.
, cap.
Simple Open-Vocabulary Object Detection with Vision Transformers
Code
2022
NeurIPS
vlm.
, pl.
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Code
2022
NeurIPS
vlm.
, cap.
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
N/A
2022
arXiv
vlm.
, cap.
Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation
Code
2022
arXiv
vlm.
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
Code
2022
arXiv
vlm.
, pl.
P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
N/A
2022
arXiv
vlm.
, pl.
Learning Object-Language Alignments for Open-Vocabulary Object Detection
Code
2023
ICLR
vlm.
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Code
2023
CVPR
other.
, vlm.
Learning to Detect and Segment for Open Vocabulary Object Detection
N/A
2023
CVPR
vlm.
, cap.
Aligning Bag of Regions for Open-Vocabulary Object Detection
Code
2023
CVPR
vlm.
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Code
2023
CVPR
vlm.
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching
N/A
2023
CVPR
vlm.
, pl.
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
N/A
2023
CVPR
vlm.
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
N/A
2023
ICML
vlm.
Multi-Modal Classifiers for Open-Vocabulary Object Detection
Project
2023
arXiv
vlm.
, cap.
Enhancing the Role of Context in Region-Word Alignment for Object Detection
N/A
2023
arXiv
cap.
, pl.
Open-Vocabulary Object Detection using Pseudo Caption Labels
N/A
2023
arXiv
vlm.
, pl.
Three ways to improve feature alignment for open vocabulary detection
N/A
2023
arXiv
vlm.
Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection
N/A
2023
arXiv
vlm.
, cap.
, pl.
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
N/A
2023
arXiv
vlm.
, cap.
, pl.
Scaling Open-Vocabulary Object Detection
N/A
Open Vocabulary Segmentation
Year
Venue
Keywords
Paper Title
Code/Project
2022
ICLR
vlm.
Language-driven Semantic Segmentation
Code
2022
CVPR
cap.
, w/o ps.
GroupViT: Semantic Segmentation Emerges from Text Supervision
Code
2022
CVPR
vlm.
ZegFormer: Decoupling Zero-Shot Semantic Segmentation
Code
2022
ECCV
cap.
, vlm.
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
N/A
2022
ECCV
vlm
, pl
, w/o ps.
Extract Free Dense Labels from CLIP
Code
2022
ECCV
vlm.
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model
Code
2022
ECCV
vlm.
, cap.
, w/o ps.
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
N/A
2022
BMVC
vlm.
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Code
2022
arXiv
vlm.
, cap.
, pl
, w/o ps.
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
Code
2022
arXiv
vlm.
, cap.
, w/o ps.
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
N/A
2023
CVPR
vlm.
, pre.
Generalized Decoding for Pixel, Image, and Language
Code
2023
CVPR
vlm.
, pl.
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Code
2023
CVPR
cap.
, vlm.
, w/o ps.
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision
Code
2023
CVPR
vlm.
Side Adapter Network for Open-Vocabulary Semantic Segmentation
Codd
2023
arXiv
vlm.
, unify
A Simple Framework for Open-Vocabulary Segmentation and Detection
Code
2023
arXiv
vlm.
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
N/A
2023
arXiv
vlm.
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
Code
2023
arXiv
vlm.
, unify
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Code
2023
arXiv
vlm.
, unify
Segment Everything Everywhere All at Once
Code
2023
arXiv
vlm.
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
N/A
2023
arXiv
vlm.
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation
N/A
2023
arXiv
vlm.
, w/o ps.
, sam
Exploring Open-Vocabulary Semantic Segmentation without Human Labels
N/A
2023
arXiv
vlm.
, unify
DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
N/A
2023
arXiv
diff.
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
Project
Open Vocabulary Video Understanding
Year
Venue
Keywords
Paper Title
Code/Project
2021
arXiv
vlm.
,open.
ActionCLIP: A New Paradigm for Video Action Recognition
Code
2022
ECCV
vlm.
,open.
Prompting Visual-Language Models for Efficient Video Understanding
Project
2022
ECCV
vlm.
Frozen CLIP Models are Efficient Video Learners
Code
2022
ECCV
vlm.
,open.
Expanding Language-Image Pretrained Models for General Video Recognition
Code
2022
arXiv
vlm.
,open.
,audio.
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
N/A
2023
AAAI
vlm.
,open.
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Code
2023
ICLR
vlm.
AIM: Adapting Image Models for Efficient Video Action Recognition
Project
2023
CVPR
vlm.
,open.
Fine-tuned CLIP Models are Efficient Video Learners
Code
2023
ICML
vlm.
,open.
Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Code
2023
arXiv
vlm.
,open.
Video Action Recognition with Attentive Semantic Units
N/A
2023
arXiv
vlm.
,open.
VicTR: Video-conditioned Text Representations for Activity Recognition
N/A
2023
arXiv
vlm.
,open.
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
N/A
Video Instance Segmentation
Open Vocabulary 3D Scene Understanding
Open Vocabulary Relation Detection
Related Domains and Beyond
Class-agnostic Detection and Segmentation
Open-World Object Detection
Open-Set Panoptic Segmentation