/awesome-described-object-detection

A curated list of papers and resources related to Described Object Detection, Open-Vocabulary/Open-World Object Detection and Referring Expression Comprehension. Updated frequently and pull requests welcomed.

Awesome PR's Welcome

Awesome Described Object Detection

A curated list of papers and resources related to Described Object Detection, Open-Vocabulary/Open-World Object Detection and Referring Expression Comprehension.

If you find any work or resources missing, please send a pull requests. Thanks!



đź“‘ If you find our projects helpful to your research, please consider citing:

@inproceedings{xie2023DOD,
  title={Described Object Detection: Liberating Object Detection with Flexible Expressions},
  author={Xie, Chi and Zhang, Zhao and Wu, Yixuan and Zhu, Feng and Zhao, Rui and Liang, Shuang},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year={2023}
}

Table of Contents

Awesome Papers

Described Object Detection

  • Aligning and Prompting Everything All at Once for Universal Visual Perception (arxiv 2023) [paper] [code]Star

  • Described Object Detection: Liberating Object Detection with Flexible Expressions (NeurIPS 2023) [paper] [dataset] [code]Star

Methods with Potential for DOD

These methods are either MLLM with capabilities related to detection/localization, or multi-task models handling both OD/OVD and REC. Though they are not directly handling DOD and not evaluated on DOD benchmarks in their original papers, it is possible that they obtain a performance similar to the DOD baseline.

  • Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs (arxiv 2023) [paper] [code (soon)]Star

  • Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models (arxiv 2023) [paper] [code]Star

  • Ferret: Refer and Ground Anything Anywhere at Any Granularity [paper] [code]Star

  • Contextual Object Detection with Multimodal Large Language Models (arxiv 2023) [paper] [demo] [code]Star

  • Kosmos-2: Grounding Multimodal Large Language Models to the World (arxiv 2023) [paper] [demo] [code]Star

  • Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (arxiv 2023) [paper] [demo] [code]Star

  • Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic (arxiv 2023) [paper] [demo] [code]Star

  • Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (arxiv 2023) [paper] [code (eval)]Star (REC, OD, etc.)

  • Universal Instance Perception as Object Discovery and Retrieval (CVPR 2023) [paper] [code]Star (REC, OVD, etc.)

  • Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (NeurIPS 2022) [paper] [code]Star

  • FindIt: Generalized Localization with Natural Language Queries (ECCV 2022) [paper] [code]Star (REC, OD, etc.)

  • GRiT: A Generative Region-to-text Transformer for Object Understanding (arxiv 2022) [paper] [demo (colab)] [code]Star

Open-Vocabulary Object Detection

  • CLIM: Contrastive Language-Image Mosaic for Region Representation (AAAI 2024) [paper] [code]

  • Simple Image-level Classification Improves Open-vocabulary Object Detection (arxiv 2023) [paper] [code]

  • ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection (AAAI 2024) [paper]

  • OpenSD: Unified Open-Vocabulary Segmentation and Detection (arxiv 2023) [paper] [code (soon)]

  • Boosting Segment Anything Model Towards Open-Vocabulary Learning (arxiv 2023) [paper]

  • Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection (arxiv 2023) [paper]

  • Language-conditioned Detection Transformer (arxiv 2023) [paper] [code]

  • The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding (arxiv 2023) [paper] [code]

  • LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024) [paper] [code (soon)]

  • Meta-Adapter: An Online Few-shot Learner for Vision-Language Model (NeurIPS 2023) [paper]

  • Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization (BMVC 2023) [paper]

  • CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection (NeurIPS 2023) [paper] [code]

  • DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection (arxiv 2023) [paper] [code (soon)]

  • Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection (arxiv 2023) [paper]

  • Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection (arxiv 2023) [paper]

  • How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (arxiv 2023) [paper] [dataset]

  • Improving Pseudo Labels for Open-Vocabulary Object Detection (arxiv 2023) [paper]

  • Scaling Open-Vocabulary Object Detection (arxiv 2023) [paper] [code (jax)]

  • Unified Open-Vocabulary Dense Visual Prediction (arxiv 2023) [paper]

  • TIB: Detecting Unknown Objects Via Two-Stream Information Bottleneck (TPAMI 2023) [paper]

  • Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection (TNNLS 2023) [paper]

  • Open-Vocabulary Object Detection via Scene Graph Discovery (ACM MM 2023) [paper]

  • Three Ways to Improve Feature Alignment for Open Vocabulary Detection (arXiv 2023) [paper]

  • Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection (arXiv 2023) [paper]

  • Open-Vocabulary Object Detection using Pseudo Caption Labels (arXiv 2023) [paper]

  • What Makes Good Open-Vocabulary Detector: A Disassembling Perspective (KDD 2023 Workshop) [paper]

  • Open-Vocabulary Object Detection With an Open Corpus (ICCV 2023) [paper]

  • Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection (ICCV 2023) [paper] [code]

  • A Simple Framework for Open-Vocabulary Segmentation and Detection (ICCV 2023) [paper] [code]

  • EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment (ICCV 2023) [paper] [website]

  • Contrastive Feature Masking Open-Vocabulary Vision Transformer (ICCV 2023) [paper]

  • Multi-Modal Classifiers for Open-Vocabulary Object Detection (ICML 2023) [paper] [code (eval)]

  • CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching (CVPR 2023) [paper] [code]

  • Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection (CVPR 2023) [paper] [code]

  • Aligning Bag of Regions for Open-Vocabulary Object Detection (CVPR 2023) [paper] [code]

  • Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers (CVPR 2023) [paper] [code]

  • DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment (CVPR 2023) [paper]

  • Learning to Detect and Segment for Open Vocabulary Object Detection (CVPR 2023) [paper]

  • F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models (ICLR 2023) [paper] [code] [website]

  • Learning Object-Language Alignments for Open-Vocabulary Object Detection (ICLR 2023) [paper] [code]

  • Simple Open-Vocabulary Object Detection with Vision Transformers (ECCV 2022) [paper] [code (jax)] [code (huggingface)]

  • Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization (arXiv 2022) [paper] [code]

  • Localized Vision-Language Matching for Open-vocabulary Object Detection (GCPR 2022) [paper] [code]

  • Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection (NeurIPS 2022) [paper] [code]

  • X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks (ECCV 2022) [paper]

  • Exploiting Unlabeled Data with Vision and Language Models for Object Detection (ECCV 2022) [paper] [code]

  • PromptDet: Towards Open-vocabulary Detection using Uncurated Images (ECCV 2022) [paper] [website] [code]

  • Open-Vocabulary DETR with Conditional Matching (ECCV 2022) [paper] [code]

  • Open Vocabulary Object Detection with Pseudo Bounding-Box Labels (ECCV 2022) [paper] [code]

  • Simple Open-Vocabulary Object Detection with Vision Transformers (ECCV 2022) [paper] [code]

  • RegionCLIP: Region-Based Language-Image Pretraining (CVPR 2022) [paper] [code]

  • Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling (CVPR 2022) [paper] [code]

  • Open-Vocabulary One-Stage Detection With Hierarchical Visual-Language Knowledge Distillation (CVPR 2022) [paper] [code]

  • Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model (CVPR 2022) [paper] [code]

  • Open-vocabulary Object Detection via Vision and Language Knowledge Distillation (ICLR 2022) [paper] [code]

  • Open-Vocabulary Object Detection Using Captions (CVPR 2021) [paper] [code]

Referring Expression Comprehension/Visual Grounding

  • GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (arxiv 2023) [paper] [code]

  • Context Disentangling and Prototype Inheriting for Robust Visual Grounding (TPAMI 2023) [paper] [code]

  • Cycle-Consistency Learning for Captioning and Grounding (AAAI 2024) [paper]

  • Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions (arxiv 2023) [paper]

  • Continual Referring Expression Comprehension via Dual Modular Memorization (arxiv 2023) [paper] [code]

  • ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability (arxiv 2023) [paper]

  • OV-VG: A Benchmark for Open-Vocabulary Visual Grounding (arxiv 2023) [paper] [code]

  • VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (arxiv 2023) [paper]

  • Language-Guided Diffusion Model for Visual Grounding (arxiv 2023) [paper] [code (soon)]

  • Fine-Grained Visual Prompting (arxiv 2023) [paper]

  • ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities (arxiv 2023) [paper] [code]

  • CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding (TMM 2023) [paper] [code]

  • Unleashing Text-to-Image Diffusion Models for Visual Perception (ICCV 2023) [paper] [website] [code]

  • Focusing On Targets For Improving Weakly Supervised Visual Grounding (ICASSP 2023) [paper]

  • Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (ICLR 2023) [paper] [code (eval)]

  • PolyFormer: Referring Image Segmentation as Sequential Polygon Generation (CVPR 2023) [paper] [website] [code] [demo]

  • Advancing Visual Grounding With Scene Knowledge: Benchmark and Method (CVPR 2023) [paper] [code]

  • Language Adaptive Weight Generation for Multi-task Visual Grounding (CVPR 2023) [paper]

  • From Coarse to Fine-grained Concept based Discrimination for Phrase Detection (CVPR 2023 Workshop) [paper]

  • Referring Expression Comprehension Using Language Adaptive Inference (AAAI 2023) [paper]

  • DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding (AAAI 2023) [paper] [code]

  • One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning (arxiv 2022) [paper]

  • Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension (arxiv 2022) [paper]

  • SeqTR: A Simple yet Universal Network for Visual Grounding (ECCV 2022) [paper] [code]

  • SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding (ECCV 2022) [paper]

  • Towards Unifying Reference Expression Generation and Comprehension (EMNLP 2022) [paper]

  • Correspondence Matters for Video Referring Expression Comprehension (ACM MM 2022) [paper]

  • Visual Grounding with Transformers (ICME 2022) [paper] [code]

  • Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning (CVPR 2022) [paper] [code]

  • Multi-Modal Dynamic Graph Transformer for Visual Grounding (CVPR 2022) [paper] [code]

  • Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding (CVPR 2022) [paper] [code]

  • OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022) [paper] [code]

  • Towards Language-guided Visual Recognition via Dynamic Convolutions (arxiv 2021) [paper]

  • Referring Transformer: A One-step Approach to Multi-task Visual Grounding (NeurIPS 2021) [paper] [code]

  • InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring (ICCV 2021) [paper] [code]

  • MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding (ICCV 2021) [paper] [website] [code]

  • Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding (CVPR 2021) [paper] [code]

  • Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos (CVPR 2021) [paper] [code]

  • Relation-aware Instance Refinement for Weakly Supervised Visual Grounding (CVPR 2021) [paper] [code]

  • Large-Scale Adversarial Training for Vision-and-Language Representation Learning (NeurIPS 2020) [paper] [code] [poster]

  • Improving One-stage Visual Grounding by Recursive Sub-query Construction (ECCV 2020) [paper] [code]

  • UNITER: UNiversal Image-TExt Representation Learning (ECCV 2020) [paper] [code]

  • Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation (CVPR 2020) [paper] [code]

  • A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension (CVPR 2020) [paper]

  • Dynamic Graph Attention for Referring Expression Comprehension (ICCV 2019) [paper]

  • A Fast and Accurate One-Stage Approach to Visual Grounding (ICCV 2019) [paper] [code]

  • Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks (CVPR 2019) [paper]

  • Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction (RSS 2018) [paper] [code]

  • Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding (IJCAI 2018) [paper] [code]

  • MAttNet: Modular Attention Network for Referring Expression Comprehension (CVPR 2018) [paper] [code]

  • Comprehension-Guided Referring Expressions (CVPR 2017) [paper]

  • Modeling Context Between Objects for Referring Expression Understanding (ECCV 2016) [paper]

Awesome Datasets

This part is still in progress.

Datasets for DOD and Similar Tasks

Name Paper Website Code Train/Eval Notes
$D^3$ Described Object Detection: Liberating Object Detection with Flexible Expressions (NeurIPS 2023) - Github eval only -

Detection Datasets

Name Paper Task Website Code Train/Eval Notes
Bamboo Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy OD - Github detector pretraining build upon public datasets; 69M image classification annotations and 32M object bounding boxes
BigDetection BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training (CVPR 2022 workshop) OD - Github detector pretraining -
Object365 Objects365: A Large-Scale, High-Quality Dataset for Object Detection (ICCV 2019) OD Link BAAI platform for download detector pretraining; train & eval -
OpenImages - OD Link Tensorflow API train & eval -
LVIS LVIS: A Dataset for Large Vocabulary Instance Segmentation (CVPR 2019) OD&OVD Link Github train & eval long-tail; federated annotation; also used for OVD
COCO Microsoft COCO: Common Objects in Context (ECCV 2014) OD&OVD Link Github train & eval also used for OVD
VOC The PASCAL Visual Object Classes (VOC) Challenge (IJCV 2010) OD Link - train & eval -

Grounding Datasets

Name Paper Task Website Code Train/Eval Notes
GRIT (Ground-and-Refer Instruction-Tuning) Ferret: Refer and Ground Anything Anywhere at Any Granularity (arxiv 2023) ground-and-refer - Github instruction tuning 1.1M samples
Ferret-Bench Ferret: Refer and Ground Anything Anywhere at Any Granularity (arxiv 2023) ground-and-refer - Github eval only -
GRIT (Grounded Image-Text) Kosmos-2: Grounding Multimodal Large Language Models to the World (arxiv 2023) visual grounding (REC & Phrase Grounding) - Github Huggingface train only created based on image-text pairs from a subset of COYO-700M and LAION-2B; 20.5M
SK-VG Advancing Visual Grounding With Scene Knowledge: Benchmark and Method (CVPR 2023) REC - Github train & eval scene knowledge in natural language is required
GRiT (General Robust Image Task) GRIT: General Robust Image Task Benchmark (arxiv 2022) REC Link Github eval only -
Cops-Ref Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension (CVPR 2020) Compositional REC - Github eval only A variant of REC
Visual Genome Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations (IJCV 2017) OD & Phrase Grounding Link Github multiple multi-modal tasks (including REC)
RefCOCOg Generation and Comprehension of Unambiguous Object Descriptions (CVPR 2016) REC - Github train & eval images from COCO
RefClef ReferItGame: Referring to Objects in Photographs of Natural Scenes (EMNLP 2014) REC - Github train & eval -
RefCOCO+ ReferItGame: Referring to Objects in Photographs of Natural Scenes (EMNLP 2014) REC - Github train & eval images from COCO
RefCOCO ReferItGame: Referring to Objects in Photographs of Natural Scenes (EMNLP 2014) REC - Github train & eval images from COCO

Related Surveys and Resources

Some survey papers regarding relevant tasks (open-vocabulary learning, etc.)

  • Towards Open Vocabulary Learning: A Survey (arxiv 2023) [paper] [repo]
  • A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future (arxiv 2023) [paper]
  • Referring Expression Comprehension: A Survey of Methods and Datasets (TMM 2020) [paper]

Some similar github repos like awesome lists:

Acknowledgement

The structure and format of this repo is inspired by BradyFU/Awesome-Multimodal-Large-Language-Models.