Awesome PR's Welcome

Towards Open Vocabulary Learning: A Survey

arXiv, 2023
Jianzong Wu * . Xiangtai Li * · Shilin Xu * · Haobo Yuan * · Henghui Ding · Yibo Yang · Xia Li · Jiangning Zhang · Yunhai Tong · Xudong Jiang · Bernard Ghanem · Dacheng Tao ·

arXiv PDF


This repo is used for recording, tracking, and benchmarking several recent open vocabulary methods to supplement our survey.
If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🔥Add Your Paper in our Repo and Survey!!!!!

[-] You are welcome to give us an issue or PR for your open vocabulary learning work !!!!!

[-] Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

[-] Our survey will be updated in 2024.3.

🔥New

[-] We update GitHub to record the available paper by the end of 2023/7/20.

🔥Highlight!!

[1] The first survey for open vocabulary learning, including open vocabulary detection/segmentation/tracking.

[2] It also contains several related domains, including foundation model tuning and open-world detection.

[3] We list detailed results for the most representative works and give a more fair and clearer comparison of different approaches.

Introduction

This survey presents the first detailed survey on open vocabulary tasks, including open-vocabulary object detection, open-vocabulary segmentation, and 3D/video open-vocabulary tasks.

Alt Text

Summary of Contents

Methods: A Survey

Keywords

  • cap.: Use caption as auxiliary training data
  • vlm.: Use pretrained VLMs like CLIP
  • pl.: Generate pseudo labels
  • w/o ps.: Training without pixel-level supervision
  • pre.: Vision-language pretraining
  • diff.: Use diffusion models
  • unify: Unify several tasks (semantic segmentation, instance segmentation, and panoptic segmentation)
  • sam: Use SAM (Segment Anything Model)
  • open.: Demonstrated with open-set capability. (only for Video Understanding)
  • audio.: With audio modality.
  • other: Other methods that cannot be grouped into above ones.

Open Vocabulary Object Detection

Year Venue Keywords Paper Title Code/Project
2021 CVPR cap. Open-Vocabulary Object Detection Using Captions Code
2022 ICLR vlm. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation Code
2022 CVPR cap., vlm., pre. RegionCLIP: Region-based Language-Image Pretraining Code
2022 CVPR vlm. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model Code
2022 CVPR vlm., cap. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation Code
2022 CVPR cap., vlm. Grounded Language-Image Pre-training [Code]
2022 NeurIPS cap., vlm. GLIPv2: Unifying Localization and VL Understanding Code
2022 GCPR cap. Localized Vision-Language Matching for Open-vocabulary Object Detection Code
2022 ECCV vlm. Open-Vocabulary DETR with Conditional Matching Code
2022 ECCV vlm., cap., pl. Open Vocabulary Object Detection with Pseudo Bounding-Box Labels Code
2022 ECCV vlm. Promptdet: Towards open-vocabulary detection using uncurated images Code
2022 ECCV vlm., pl., w/o ps. Detecting Twenty-thousand Classes using Image-level Supervision Code
2022 ECCV vlm.. pl. Exploiting unlabeled data with vision and language models for object detection Code
2022 ECCV vlm., cap. Simple Open-Vocabulary Object Detection with Vision Transformers Code
2022 NeurIPS vlm., pl. Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection Code
2022 NeurIPS vlm., cap. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection N/A
2022 arXiv vlm. Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization Code
2022 arXiv vlm., pl. P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection N/A
2023 ICLR vlm., pl. Learning Object-Language Alignments for Open-Vocabulary Object Detection Code
2023 ICLR vlm. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models Code
2023 CVPR other., vlm. Learning to Detect and Segment for Open Vocabulary Object Detection N/A
2023 CVPR vlm., cap. Aligning Bag of Regions for Open-Vocabulary Object Detection Code
2023 CVPR vlm. Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection Code
2023 CVPR vlm. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching N/A
2023 CVPR vlm., pl. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment N/A
2023 CVPR vlm. Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers N/A
2023 ICML vlm. Multi-Modal Classifiers for Open-Vocabulary Object Detection Project
2023 arXiv vlm. GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning N/A
2023 arXiv vlm., cap. Enhancing the Role of Context in Region-Word Alignment for Object Detection N/A
2023 arXiv cap., pl. Open-Vocabulary Object Detection using Pseudo Caption Labels N/A
2023 arXiv vlm., pl. Three ways to improve feature alignment for open vocabulary detection N/A
2023 arXiv vlm. Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection N/A
2023 arXiv vlm., cap., pl. MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks N/A
2023 arXiv vlm., cap., pl. Scaling Open-Vocabulary Object Detection N/A
2023 arXiv vlm. Open-Vocabulary Object Detection via Scene Graph Discovery N/A
2023 arXiv unify., vlm., pre. CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction Code
2024 AAAI unify., vlm., pre. CLIM: Contrastive Language-Image Mosaic for Region Representation Code

Open Vocabulary Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR unify., vlm. Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation Code
2023 CVPR unify., vlm. FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation Code

Semantic Segmentation

Year Venue Keywords Paper Title Code/Project
2022 ICLR vlm. Language-driven Semantic Segmentation Code
2022 CVPR cap., w/o ps. GroupViT: Semantic Segmentation Emerges from Text Supervision Code
2022 CVPR vlm. ZegFormer: Decoupling Zero-Shot Semantic Segmentation Code
2022 ECCV cap., vlm. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels N/A
2022 ECCV vlm, pl, w/o ps. Extract Free Dense Labels from CLIP Code
2022 ECCV vlm. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model Code
2022 ECCV vlm., cap., w/o ps. Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding N/A
2022 BMVC vlm. Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models Code
2022 arXiv vlm., cap., pl, w/o ps. Perceptual Grouping in Contrastive Vision-Language Models Code
2022 arXiv vlm., cap., pl, w/o ps. SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation Code
2022 arXiv vlm., cap., w/o ps. Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning N/A
2023 CVPR vlm., pre. Generalized Decoding for Pixel, Image, and Language Code
2023 CVPR vlm., pl. Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Code
2023 CVPR cap., vlm., w/o ps. Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision Code
2023 CVPR vlm. Side Adapter Network for Open-Vocabulary Semantic Segmentation Codd
2023 arXiv vlm., unify A Simple Framework for Open-Vocabulary Segmentation and Detection Code
2023 arXiv vlm. Global Knowledge Calibration for Fast Open-Vocabulary Segmentation N/A
2023 arXiv vlm. CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation Code
2023 arXiv vlm., unify Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Code
2023 arXiv vlm., unify Segment Everything Everywhere All at Once Code
2023 arXiv vlm. MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation N/A
2023 arXiv vlm. TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation N/A
2023 arXiv vlm., w/o ps., sam Exploring Open-Vocabulary Semantic Segmentation without Human Labels N/A
2023 arXiv vlm., unify DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model N/A
2023 arXiv diff. Diffusion Models for Zero-Shot Open-Vocabulary Segmentation Project
2023 ICCV diff. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models Project
2023 ICCV diff. Guiding Text-to-Image Diffusion Model Towards Grounded Generation Project

Instance Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm. Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation Code
2022 CVPR cap., pl., vlm. Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling Code
2023 CVPR vlm, cap, w/o ps. Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations Code
2023 arXiv cap. Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Code

Panoptic Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR unify., vlm. Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation Code
2022 arXiv vlm Open-Vocabulary Panoptic Segmentation with MaskCLIP N/A
2023 CVPR diff, vlm Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models Code
2023 arXiv vlm. Open-vocabulary Panoptic Segmentation with Embedding Modulation N/A
2023 arXiv vlm., 'unify' Hierarchical Open-vocabulary Universal Image Segmentation Code

Open Vocabulary Video Understanding

Video Classification

Year Venue Keywords Paper Title Code/Project
2021 arXiv vlm.,open. ActionCLIP: A New Paradigm for Video Action Recognition Code
2022 ECCV vlm.,open. Prompting Visual-Language Models for Efficient Video Understanding Project
2022 ECCV vlm. Frozen CLIP Models are Efficient Video Learners Code
2022 ECCV vlm.,open. Expanding Language-Image Pretrained Models for General Video Recognition Code
2022 arXiv vlm.,open.,audio. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models N/A
2023 AAAI vlm.,open. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition Code
2023 ICLR vlm. AIM: Adapting Image Models for Efficient Video Action Recognition Project
2023 CVPR vlm.,open. Fine-tuned CLIP Models are Efficient Video Learners Code
2023 ICML vlm.,open. Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization Code
2023 arXiv vlm.,open. Video Action Recognition with Attentive Semantic Units N/A
2023 arXiv vlm.,open. VicTR: Video-conditioned Text Representations for Activity Recognition N/A
2023 arXiv vlm.,open. MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge N/A

Tracking

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm.,open. OVTrack: Open-Vocabulary Multiple Object Tracking Project

Video Instance Segmentation

Year Venue Keywords Paper Title Code/Project
2023 arXiv vlm.,open. Towards Open-Vocabulary Video Instance Segmentation N/A
2023 arXiv vlm.,open. OpenVIS: Open-vocabulary Video Instance Segmentation N/A

Open Vocabulary 3D Scene Understanding

3D Classification

Year Venue Keywords Paper Title Code/Project
2022 CVPR vlm. PointCLIP: Point Cloud Understanding by CLIP Code
2022 arXiv vlm. CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training Code
2022 arXiv vlm. PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning Code
2022 arXiv vlm. LidarCLIP or: How I Learned to Talk to Point Clouds Code
2023 CVPR vlm. ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding Code
2023 ICML vlm. Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining Code

3D Detection

Year Venue Keywords Paper Title Code/Project
2022 arXiv vlm. Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning N/A
2023 CVPR vlm. Open-Vocabulary Point-Cloud Object Detection without 3D Annotation Code
2023 NeurIPS vlm. CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection Project

3D segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm. PLA: Language-Driven Open-Vocabulary 3D Scene Understanding Code
2023 CVPR vlm. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP Code
2023 CVPR vlm. OpenScene: 3D Scene Understanding with Open Vocabularies Project
2023 arXiv vlm. CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP N/A
2023 arXiv vlm. OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation Project
2023 arXiv vlm. OpenMask3D: Open-Vocabulary 3D Instance Segmentation Project

Related Domains and Beyond

Class-agnostic Detection and Segmentation

Year Venue Keywords Paper Title Code/Project
2022 RA-L - Learning Open-World Object Proposals without Learning to Classify Code
2021 ICCV - Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation Project
2022 CVPR - Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity Project
2022 ECCV - Class-agnostic object detection with multi-modal transformer Code
2022 TPAMI - Open World Entity Segmentation Project
2022 arXiv - Fine-Grained Entity Segmentation Project

Open-World Object Detection

Year Venue Keywords Paper Title Code/Project
2015 CVPR - Towards Open World Recognition N/A
2021 CVPR - Towards Open World Object Detection. Code
2022 CVPR - OW-DETR: Open-world Detection Transformer Code
2022 ECCV - UC-OWOD: Unknown-Classified Open World Object Detection Code
2022 arXiv - Revisiting Open World Object Detection Code
2022 arXiv - Rectifying Open-set Object Detection: A Taxonomy, Practical Applications, and Proper Evaluation [N/A]
2022 arXiv - Open World DETR: Transformer based Open World Object Detection N/A
2022 arXiv - PROB: Probabilistic Objectness for Open World Object Detection Code

Open-Set Panoptic Segmentation

Year Venue Keywords Paper Title Code/Project
2021 CVPR - Exemplar-Based Open-Set Panoptic Segmentation Network Project
2022 arXiv - Dual Decision Improves Open-Set Panoptic Segmentation Code

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{wu2023open,
      title={Towards Open Vocabulary Learning: A Survey}, 
      author={Jianzong Wu and Xiangtai Li and Shilin Xu and Haobo Yuan and Henghui Ding and Yibo Yang and Xia Li and Jiangning Zhang and Yunhai Tong and Xudong Jiang and Bernard Ghanem and Dacheng Tao},
      year={2023},
      journal={arXiv pre-print},
}

Contact

jzwu@stu.pku.edu.cn
xiangtai.li@ntu.edu.sg 
lxtpku@pku.edu.cn