Grounded 3D-LLM with Referent Tokens

This repository will release the official implementation of "Grounded 3D-LLM with Referent Token".

[Paper] [Arxiv] [Website] [Data]

Abstract

Prior studies on 3D scene comprehension have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (LMMs) to consolidate various 3D visual tasks within a unified generative framework. The model utilizes a series of ``referent tokens'' to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data arbitrarily. 3D vision tasks are naturally transformed into language formats through task-specific prompts. To effectively associate the scene with text, we curate the grounded language datasets either from human-annotated sources or by bootstrapping existing object labels at the phrase level. We then employ Contrastive Language-Scene Pre-training (CLASP) to bridge the divide between 3D vision and language models, thus facilitating the use of referent tokens in subsequent language modeling tasks. Our comprehensive evaluation covers open-ended tasks like 3D visual question answering and dense captioning, as well as close-ended tasks such as object detection and language grounding.

Grounded Scene Caption Data Visualization

Please refer to the data visualization page for detailed instructions on the minimal setup for visualizing the grounded scene caption dataset.

Model Training

Step 1: Environment setups and dataset preparation.

Grounded 3D-LLM is trained using 4 or 8 NVIDIA Tesla A100 GPUs. Please refer to the installation page for detailed installation scripts for model training.

Please download all the scene-language datasets the from HuggingFace. The datasets are listed as follows:

Dataset	# for Train	# for Eval
ScanRefer	36639	9503
Scan2Cap	36639	9503
ScanQA	26516	9402
Object-Description	28197	7912
GroundedSceneCaption	84301	--
EmbodiedPlanning	3500	--
EmbodiedDialogue	129799	--
GlobalSceneCaption	4065	--
3D-LLM	27627	--
Alpaca	51865	--

Please download the pretrained weights from HuggingFace and place them in $ROOT_PATH/pretrained/.

Please download the pretrained LLM weights (Tiny-Vicuna-1B) and store them in $ROOT_PATH/pretrained/llm_weight/Tiny-Vicuna-1B/

If you would like to utilize our pretrained model checkpoints, they can be obtained from HuggingFace. Please save these in the checkpoint directory located at $ROOT_PATH/saved.

Steps	Model Checkpoints
1	Mask3D-CLIP
2	Mask3D-CLASP
3	Grounded 3D-LLM

After completing the downloads, the root folder should be organized as follows:

ROOT_PATH
├── data                            # data
│   ├── langdata
│   │   │── groundedscenecaption_format.json
│   │   │── scanrefer_format.json
│   │   │── scanqa_format.json
│   │   │── ...
│   ├── processed
│   │── rawscannet
│   │   │── scans
│   │   │── scannetv2-labels.combined.tsv
│── pretrained                      # pretrained weights for model training
│   │── bert-base-uncased           # bert pretrained weights
│   │── label_clip_features.pth     # clip's text features for scannet-200 class names
│   │── llm_weight
│   │   │── Tiny-Vicuna-1B          # pretrained weights from https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B
│── saved                           # model checkpoints saved path
│   │── step1_mask3d_clip_4GPUS
│   │── step2_mask3d_lang_4GPUS
│   │── step3_mask3d_lang_4GPUS

Step 2: Pre-train the Mask3D detector:

bash final_scripts/step1_pretrain_detector.sh

Step 3: After training the detector, pre-train the detector using Contrastive Language-Scene Pre-training:

bash final_scripts/step2_pretrain_3d-clasp.sh

Step 3: After contrastive pre-training, train the entire Grounded 3D-LLM:

bash final_scripts/step3_train_grounded3dllm.sh

The model checkpoints will be saved in saved/step3_mask3d_lang_4GPUS/last-checkpoint.pth, and the inference results will be stored in saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/.

Model Evaluation

To evaluate all the respective results, run the following command:

bash final_scripts/test_llm.sh ./saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/

Demo

To interact with Grounded 3D-LLM via the demo chat, first run the model inference and ensure that the scene_features are saved in saved/step3_mask3d_lang_4GPUS/scene_features. After that, launch the gradio demo chat by running the following command:

bash web_chat_demo/web_chat_demo.sh

Please note that the visualization of the related segmentation masks is not yet supported in the Gradio demo.

ToDo List

Release Grouded Scene Caption data (ScanNet).
Release data visualizer.
Release data generation code.
Release pre-trained checkpoints.
Release Grounded 3D-LLM training and evaluation.
Demo supports mask visualization.

Acknowledgement

Many thanks to the following open-source projects:

OpenRobotLab/Grounded_3D-LLM