/Awesome-3D-Vision-and-Language

A collection of 3D vision and language (e.g., 3D Visual Grounding, 3D Question Answering and 3D Dense Caption) papers and datasets.

MIT LicenseMIT

Awesome-3D-Vision-and-Language Awesome

A curated list of research papers in 3D visual grounding. (Contact: jhj20 at mails.tsinghua.edu.cn)

💬 News

[2022/04/15]: Create this repository.
[2022/05/25]: Expend the scope to 3D-Vision-and-Language, e.g., 3D Visual Grounding, 3D Dense Caption and 3D Question Answering.

Table of Contents

3D Visual Grounding

3D VG Paper Roadmap (Chronological Order)

ECCV 2020

  1. Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website]
  2. Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website]

AAAI 2021

  1. Huang, Pin-Hao, et al. Text-guided graph neural networks for referring 3d instance segmentation. AAAI 2021. [Paper] [Code]

CVPR 2021

  1. Feng, Mingtao, et al. Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud. CVPR 2021. [Paper] [Code]
  2. Liu, Haolin, et al. Refer-It-in-RGBD: A Bottom-Up Approach for 3D Visual Grounding in RGBD Images. CVPR 2021. [Paper] [Code] [Website]

ICCV 2021

  1. Yang, Zhengyuan, et al. SAT: 2D Semantics Assisted Training for 3D Visual Grounding. ICCV 2021, Oral. [Paper] [Code]

    Personal Notes:

    • Use corresponding 2D image data(ROI feature, label, bbox coordinates and camera pose) to assist 3D grounding.
    • Very solid experiments.
  2. Yuan, Zhihao, et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring . ICCV 2021. [Paper] [Code]

  3. Zhao, Lichen, et al. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. ICCV 2021. [Paper] [Code]

    Personal Notes:

    • The novelty of this paper comes from the coordinate-guied contextual aggregation module.

ACM-MM 2021

  1. He, Dailan, et al. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. ACM-MM 2021. [Paper]

CVPR 2022

  1. Huang, Shijia, et al. Multi-View Transformer for 3D Visual Grounding. CVPR 2022. [Paper] [Code]

    Personal Notes:

    • Rotating the center xyz of objects to provide view-related positional information before going through a Tranformer decoder.
    • SOTA results on Nr3D and Sr3D, good reuslts on ScanRefer.
  2. Luo, Junyu, et al. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection. CVPR 2022, Oral. [Paper] [Code]

    Personal Notes:

    • First single stage work in 3D Visual Grounding !!!
    • The general idea is similar to the iterative shrinking work in 2D Visual Grounding, but the design is more elegant.
  3. Cai, Daigang, et al. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR 2022.

3D VG Datasets

  1. ReferIt3D(Nr3D, Sr3D/Sr3D+): Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website] [Leaderboard]

    Dataset Statistics:

    • Natural Reference in 3D (Nr3D)
    • Spatial Reference in 3D (Sr3D)
  2. ScanRefer: Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website] [Leaderboard]

    Dataset Statistics:

    • On average, there are 13.81 objects, 64.48 descriptions per scene, and 4.67 descriptions per object.
    • Average length of descriptions is 20.27. Frequency of object attributes: spatial language (98.7%), color (74.7%), shape terms (64.9%), and size information (14.2%).

3D VG Workshops

  1. CVPR 2021 1st Workshop on Language for 3D Scenes. [Website]

3D Question Answering

3D QA Paper Roadmap (Chronological Order)

CVPR 2022

  1. Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Code]

ICLR 2023

  1. Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]

3D QA Datasets

  1. ScanQA: Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Data Preparation]

  2. SQA3D: Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]

3D Dense Caption

Pending...