/Awesome-LLM-3D

Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources

MIT LicenseMIT

Awesome-LLM-3D Awesome Maintenance PR's Welcome arXiv

🏠 About

Here is a curated list of papers about 3D-Related Tasks empowered by Large Language Models (LLMs). It contains various tasks including 3D understanding, reasoning, generation, and embodied agents. Also, we include other Foundation Models (CLIP, SAM) for the whole picture of this area.

This is an active repository, you can watch for following the latest advances. If you find it useful, please kindly star ⭐ this repo and cite the paper.

🔥 News

Table of Content

3D Understanding via LLM

Date Keywords Institute (first) Paper Publication Others
2024-09-28 LLaVA-3D HKU LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Arxiv project
2024-09-08 MSR3D BIGAI Multi-modal Situated Reasoning in 3D Scenes NeurIPS '24 project
2024-08-28 GreenPLM HUST More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding Arxiv github
2024-06-07 SpatialPIN Oxford SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors NeurIPS '24 project
2024-05-02 MiniGPT-3D HUST MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors ACM MM '24 project
2024-02-27 ShapeLLM XJTU ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Arxiv project
2024-01-22 SpatialVLM Google DeepMind SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities CVPR '24 project
2023-12-21 LiDAR-LLM PKU LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding Arxiv project
2023-12-15 3DAP Shanghai AI Lab 3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V Arxiv project
2023-12-13 Chat-3D v2 ZJU Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers Arxiv github
2023-12-5 GPT4Point HKU GPT4Point: A Unified Framework for Point-Language Understanding and Generation Arxiv github
2023-11-30 LL3DA Fudan University LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning Arxiv github
2023-11-26 ZSVG3D CUHK(SZ) Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding Arxiv project
2023-11-18 LEO BIGAI An Embodied Generalist Agent in 3D World Arxiv github
2023-10-14 JM3D-LLM Xiamen University JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues ACM MM '23 github
2023-10-10 Uni3D BAAI Uni3D: Exploring Unified 3D Representation at Scale ICLR '24 project
2023-9-27 - KAUST Zero-Shot 3D Shape Correspondence Siggraph Asia '23 -
2023-9-21 LLM-Grounder U-Mich LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent ICRA '24 github
2023-9-1 Point-Bind CUHK Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following Arxiv github
2023-8-31 PointLLM CUHK PointLLM: Empowering Large Language Models to UnderstandPoint Clouds Arxiv github
2023-8-17 Chat-3D ZJU Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes Arxiv github
2023-8-8 3D-VisTA BIGAI 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment ICCV '23 github
2023-7-24 3D-LLM UCLA 3D-LLM: Injecting the 3D World into Large Language Models NeurIPS '23 github
2023-3-29 ViewRefer CUHK ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding ICCV '23 github
2022-9-12 - MIT Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding Arxiv github

3D Understanding via other Foundation Models

ID keywords Institute (first) Paper Publication Others
2024-04-07 Any2Point Shanghai AI Lab Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding ECCV 2024 github
2024-03-16 N2F2 Oxford-VGG N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields Arxiv -
2023-12-17 SAI3D PKU SAI3D: Segment Any Instance in 3D Scenes Arxiv project
2023-12-17 Open3DIS VinAI Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance Arxiv project
2023-11-6 OVIR-3D Rutgers University OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data CoRL '23 github
2023-10-29 OpenMask3D ETH OpenMask3D: Open-Vocabulary 3D Instance Segmentation NeurIPS '23 project
2023-10-5 Open-Fusion - Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation Arxiv github
2023-9-22 OV-3DDet HKUST CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection NeurIPS '23 github
2023-9-19 LAMP - From Language to 3D Worlds: Adapting Language Model for Point Cloud Perception OpenReview -
2023-9-15 OpenNerf - OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views OpenReview github
2023-9-1 OpenIns3D Cambridge OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation Arxiv project
2023-6-7 Contrastive Lift Oxford-VGG Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion NeurIPS '23 github
2023-6-4 Multi-CLIP ETH Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes Arxiv -
2023-5-23 3D-OVS NTU Weakly Supervised 3D Open-vocabulary Segmentation NeurIPS '23 github
2023-5-21 VL-Fields University of Edinburgh VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations ICRA '23 project
2023-5-8 CLIP-FO3D Tsinghua University CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP ICCVW '23 -
2023-4-12 3D-VQA ETH CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes CVPRW '23 github
2023-4-3 RegionPLC HKU RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding Arxiv project
2023-3-20 CG3D JHU CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition Arxiv github
2023-3-16 LERF UC Berkeley LERF: Language Embedded Radiance Fields ICCV '23 github
2023-2-14 ConceptFusion MIT ConceptFusion: Open-set Multimodal 3D Mapping RSS '23 project
2023-1-12 CLIP2Scene HKU CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP CVPR '23 github
2022-12-1 UniT3D TUM UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding ICCV '23 github
2022-11-29 PLA HKU PLA: Language-Driven Open-Vocabulary 3D Scene Understanding CVPR '23 github
2022-11-28 OpenScene ETHz OpenScene: 3D Scene Understanding with Open Vocabularies CVPR '23 github
2022-10-11 CLIP-Fields NYU CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory Arxiv project
2022-7-23 Semantic Abstraction Columbia Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models CoRL '22 project
2022-4-26 ScanNet200 TUM Language-Grounded Indoor 3D Semantic Segmentation in the Wild ECCV '22 project

3D Reasoning

Date keywords Institute (first) Paper Publication Others
2023-5-20 3D-CLR UCLA 3D Concept Learning and Reasoning from Multi-View Images CVPR '23 github
- Transcribe3D TTI, Chicago Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning CoRL '23 github

3D Generation

Date keywords Institute Paper Publication Others
2023-11-29 ShapeGPT Fudan University ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model Arxiv github
2023-11-27 MeshGPT TUM MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers Arxiv project
2023-10-19 3D-GPT ANU 3D-GPT: Procedural 3D Modeling with Large Language Models Arxiv github
2023-9-21 LLMR MIT LLMR: Real-time Prompting of Interactive Worlds using Large Language Models Arxiv github
2023-9-20 DreamLLM MEGVII DreamLLM: Synergistic Multimodal Comprehension and Creation Arxiv github
2023-4-1 ChatAvatar Deemos Tech DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance ACM TOG website

3D Embodied Agent

Date keywords Institute Paper Publication Others
2024-01-22 SpatialVLM Deepmind SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities CVPR '24 project
2023-11-27 Dobb-E NYU On Bringing Robots Home Arxiv github
2023-11-26 STEVE ZJU See and Think: Embodied Agent in Virtual Environment Arxiv github
2023-11-18 LEO BIGAI An Embodied Generalist Agent in 3D World Arxiv github
2023-9-14 UniHSI Shanghai AI Lab Unified Human-Scene Interaction via Prompted Chain-of-Contacts Arxiv github
2023-7-28 RT-2 Google-DeepMind RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Arxiv github
2023-7-12 SayPlan QUT Centre for Robotics SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning CoRL '23 github
2023-7-12 VoxPoser Stanford VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models Arxiv github
2022-12-13 RT-1 Google RT-1: Robotics Transformer for Real-World Control at Scale Arxiv github
2022-12-8 LLM-Planner The Ohio State University LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models ICCV '23 github
2022-10-11 CLIP-Fields NYU, Meta CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory RSS '23 github
2022-09-20 NLMap-SayCan Google Open-vocabulary Queryable Scene Representations for Real World Planning ICRA '23 github

3D Benchmarks

Date keywords Institute Paper Publication Others
2024-09-08 MSQA / MSNN BIGAI Multi-modal Situated Reasoning in 3D Scenes NeurIPS '24 project
2024-06-10 3D-GRAND / 3D-POPE UMich 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Arxiv project
2024-1-18 SceneVerse BIGAI SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding Arxiv github
2023-12-26 EmbodiedScan Shanghai AI Lab EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI Arxiv github
2023-12-17 M3DBench Fudan University M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Arxiv github
2023-11-29 - DeepMind Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects Arxiv github
2022-10-14 SQA3D BIGAI SQA3D: Situated Question Answering in 3D Scenes ICLR '23 github
2021-12-20 ScanQA RIKEN AIP ScanQA: 3D Question Answering for Spatial Scene Understanding CVPR '23 github
2020-12-3 Scan2Cap TUM Scan2Cap: Context-aware Dense Captioning in RGB-D Scans CVPR '21 github
2020-8-23 ReferIt3D Stanford ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes ECCV '20 github
2019-12-18 ScanRefer TUM ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language ECCV '20 github

Contributing

Your contributions are always welcome!

I will keep some pull requests open if I'm not sure if they are awesome for 3D LLMs, you could vote for them by adding 👍 to them.


If you have any questions about this opinionated list, please get in touch at xianzheng@robots.ox.ac.uk or Wechat ID: mxz1997112.

Star History

Star History Chart

Citation

If you find this repository useful, please consider citing this paper:

@misc{ma2024llmsstep3dworld,
      title={When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models}, 
      author={Xianzheng Ma and Yash Bhalgat and Brandon Smart and Shuai Chen and Xinghui Li and Jian Ding and Jindong Gu and Dave Zhenyu Chen and Songyou Peng and Jia-Wang Bian and Philip H Torr and Marc Pollefeys and Matthias Nießner and Ian D Reid and Angel X. Chang and Iro Laina and Victor Adrian Prisacariu},
      year={2024},
      journal={arXiv preprint arXiv:2405.10255},
}

Acknowledgement

This repo is inspired by Awesome-LLM