Awesome-LLM-3D

🏠 About

Here is a curated list of papers about 3D-Related Tasks empowered by Large Language Models (LLMs). It contains various tasks including 3D understanding, reasoning, generation, and embodied agents. Also, we include other Foundation Models (CLIP, SAM) for the whole picture of this area.

This is an active repository, you can watch for following the latest advances. If you find it useful, please kindly star ⭐ this repo and cite the paper.

🔥 News

[2024-05-16] 📢 Check out the first survey paper in the 3D-LLM domain: When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
[2024-01-06] Runsen Xu added chronological information and Xianzheng Ma reorganized it in Z-A order for better following the latest advances.
[2023-12-16] Xianzheng Ma and Yash Bhalgat curated this list and published the first version;

Table of Content

Awesome-LLM-3D

3D Understanding via LLM

Date	Keywords	Institute (first)	Paper	Publication	Others
2024-09-28	LLaVA-3D	HKU	LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness	Arxiv	project
2024-09-08	MSR3D	BIGAI	Multi-modal Situated Reasoning in 3D Scenes	NeurIPS '24	project
2024-08-28	GreenPLM	HUST	More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding	Arxiv	github
2024-06-07	SpatialPIN	Oxford	SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors	NeurIPS '24	project
2024-05-02	MiniGPT-3D	HUST	MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors	ACM MM '24	project
2024-02-27	ShapeLLM	XJTU	ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Arxiv	project
2024-01-22	SpatialVLM	Google DeepMind	SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	CVPR '24	project
2023-12-21	LiDAR-LLM	PKU	LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding	Arxiv	project
2023-12-15	3DAP	Shanghai AI Lab	3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V	Arxiv	project
2023-12-13	Chat-3D v2	ZJU	Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers	Arxiv	github
2023-12-5	GPT4Point	HKU	GPT4Point: A Unified Framework for Point-Language Understanding and Generation	Arxiv	github
2023-11-30	LL3DA	Fudan University	LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	Arxiv	github
2023-11-26	ZSVG3D	CUHK(SZ)	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	Arxiv	project
2023-11-18	LEO	BIGAI	An Embodied Generalist Agent in 3D World	Arxiv	github
2023-10-14	JM3D-LLM	Xiamen University	JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues	ACM MM '23	github
2023-10-10	Uni3D	BAAI	Uni3D: Exploring Unified 3D Representation at Scale	ICLR '24	project
2023-9-27	-	KAUST	Zero-Shot 3D Shape Correspondence	Siggraph Asia '23	-
2023-9-21	LLM-Grounder	U-Mich	LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	ICRA '24	github
2023-9-1	Point-Bind	CUHK	Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following	Arxiv	github
2023-8-31	PointLLM	CUHK	PointLLM: Empowering Large Language Models to UnderstandPoint Clouds	Arxiv	github
2023-8-17	Chat-3D	ZJU	Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes	Arxiv	github
2023-8-8	3D-VisTA	BIGAI	3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	ICCV '23	github
2023-7-24	3D-LLM	UCLA	3D-LLM: Injecting the 3D World into Large Language Models	NeurIPS '23	github
2023-3-29	ViewRefer	CUHK	ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding	ICCV '23	github
2022-9-12	-	MIT	Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding	Arxiv	github

3D Understanding via other Foundation Models

ID	keywords	Institute (first)	Paper	Publication	Others
2024-04-07	Any2Point	Shanghai AI Lab	Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding	ECCV 2024	github
2024-03-16	N2F2	Oxford-VGG	N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields	Arxiv	-
2023-12-17	SAI3D	PKU	SAI3D: Segment Any Instance in 3D Scenes	Arxiv	project
2023-12-17	Open3DIS	VinAI	Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance	Arxiv	project
2023-11-6	OVIR-3D	Rutgers University	OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data	CoRL '23	github
2023-10-29	OpenMask3D	ETH	OpenMask3D: Open-Vocabulary 3D Instance Segmentation	NeurIPS '23	project
2023-10-5	Open-Fusion	-	Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation	Arxiv	github
2023-9-22	OV-3DDet	HKUST	CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection	NeurIPS '23	github
2023-9-19	LAMP	-	From Language to 3D Worlds: Adapting Language Model for Point Cloud Perception	OpenReview	-
2023-9-15	OpenNerf	-	OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views	OpenReview	github
2023-9-1	OpenIns3D	Cambridge	OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation	Arxiv	project
2023-6-7	Contrastive Lift	Oxford-VGG	Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion	NeurIPS '23	github
2023-6-4	Multi-CLIP	ETH	Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes	Arxiv	-
2023-5-23	3D-OVS	NTU	Weakly Supervised 3D Open-vocabulary Segmentation	NeurIPS '23	github
2023-5-21	VL-Fields	University of Edinburgh	VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations	ICRA '23	project
2023-5-8	CLIP-FO3D	Tsinghua University	CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	ICCVW '23	-
2023-4-12	3D-VQA	ETH	CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes	CVPRW '23	github
2023-4-3	RegionPLC	HKU	RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding	Arxiv	project
2023-3-20	CG3D	JHU	CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition	Arxiv	github
2023-3-16	LERF	UC Berkeley	LERF: Language Embedded Radiance Fields	ICCV '23	github
2023-2-14	ConceptFusion	MIT	ConceptFusion: Open-set Multimodal 3D Mapping	RSS '23	project
2023-1-12	CLIP2Scene	HKU	CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP	CVPR '23	github
2022-12-1	UniT3D	TUM	UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding	ICCV '23	github
2022-11-29	PLA	HKU	PLA: Language-Driven Open-Vocabulary 3D Scene Understanding	CVPR '23	github
2022-11-28	OpenScene	ETHz	OpenScene: 3D Scene Understanding with Open Vocabularies	CVPR '23	github
2022-10-11	CLIP-Fields	NYU	CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory	Arxiv	project
2022-7-23	Semantic Abstraction	Columbia	Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models	CoRL '22	project
2022-4-26	ScanNet200	TUM	Language-Grounded Indoor 3D Semantic Segmentation in the Wild	ECCV '22	project

3D Reasoning

Date	keywords	Institute (first)	Paper	Publication	Others
2023-5-20	3D-CLR	UCLA	3D Concept Learning and Reasoning from Multi-View Images	CVPR '23	github
-	Transcribe3D	TTI, Chicago	Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning	CoRL '23	github

3D Generation

Date	keywords	Institute	Paper	Publication	Others
2023-11-29	ShapeGPT	Fudan University	ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model	Arxiv	github
2023-11-27	MeshGPT	TUM	MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	Arxiv	project
2023-10-19	3D-GPT	ANU	3D-GPT: Procedural 3D Modeling with Large Language Models	Arxiv	github
2023-9-21	LLMR	MIT	LLMR: Real-time Prompting of Interactive Worlds using Large Language Models	Arxiv	github
2023-9-20	DreamLLM	MEGVII	DreamLLM: Synergistic Multimodal Comprehension and Creation	Arxiv	github
2023-4-1	ChatAvatar	Deemos Tech	DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance	ACM TOG	website

3D Embodied Agent

Date	keywords	Institute	Paper	Publication	Others
2024-01-22	SpatialVLM	Deepmind	SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	CVPR '24	project
2023-11-27	Dobb-E	NYU	On Bringing Robots Home	Arxiv	github
2023-11-26	STEVE	ZJU	See and Think: Embodied Agent in Virtual Environment	Arxiv	github
2023-11-18	LEO	BIGAI	An Embodied Generalist Agent in 3D World	Arxiv	github
2023-9-14	UniHSI	Shanghai AI Lab	Unified Human-Scene Interaction via Prompted Chain-of-Contacts	Arxiv	github
2023-7-28	RT-2	Google-DeepMind	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	Arxiv	github
2023-7-12	SayPlan	QUT Centre for Robotics	SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning	CoRL '23	github
2023-7-12	VoxPoser	Stanford	VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	Arxiv	github
2022-12-13	RT-1	Google	RT-1: Robotics Transformer for Real-World Control at Scale	Arxiv	github
2022-12-8	LLM-Planner	The Ohio State University	LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models	ICCV '23	github
2022-10-11	CLIP-Fields	NYU, Meta	CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory	RSS '23	github
2022-09-20	NLMap-SayCan	Google	Open-vocabulary Queryable Scene Representations for Real World Planning	ICRA '23	github

3D Benchmarks

Date	keywords	Institute	Paper	Publication	Others
2024-09-08	MSQA / MSNN	BIGAI	Multi-modal Situated Reasoning in 3D Scenes	NeurIPS '24	project
2024-06-10	3D-GRAND / 3D-POPE	UMich	3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination	Arxiv	project
2024-1-18	SceneVerse	BIGAI	SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding	Arxiv	github
2023-12-26	EmbodiedScan	Shanghai AI Lab	EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	Arxiv	github
2023-12-17	M3DBench	Fudan University	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Arxiv	github
2023-11-29	-	DeepMind	Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects	Arxiv	github
2022-10-14	SQA3D	BIGAI	SQA3D: Situated Question Answering in 3D Scenes	ICLR '23	github
2021-12-20	ScanQA	RIKEN AIP	ScanQA: 3D Question Answering for Spatial Scene Understanding	CVPR '23	github
2020-12-3	Scan2Cap	TUM	Scan2Cap: Context-aware Dense Captioning in RGB-D Scans	CVPR '21	github
2020-8-23	ReferIt3D	Stanford	ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes	ECCV '20	github
2019-12-18	ScanRefer	TUM	ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language	ECCV '20	github

Contributing

Your contributions are always welcome!

I will keep some pull requests open if I'm not sure if they are awesome for 3D LLMs, you could vote for them by adding 👍 to them.

If you have any questions about this opinionated list, please get in touch at xianzheng@robots.ox.ac.uk or Wechat ID: mxz1997112.

Star History

Citation

If you find this repository useful, please consider citing this paper:

@misc{ma2024llmsstep3dworld,
      title={When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models}, 
      author={Xianzheng Ma and Yash Bhalgat and Brandon Smart and Shuai Chen and Xinghui Li and Jian Ding and Jindong Gu and Dave Zhenyu Chen and Songyou Peng and Jia-Wang Bian and Philip H Torr and Marc Pollefeys and Matthias Nießner and Ian D Reid and Angel X. Chang and Iro Laina and Victor Adrian Prisacariu},
      year={2024},
      journal={arXiv preprint arXiv:2405.10255},
}

Acknowledgement

This repo is inspired by Awesome-LLM

tecworks-dev/Awesome-LLM-3D