Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
Preliminary Code.Installation:
Please follow Mask2Former to install the environment and download the pretrained weight to the current directory if extracting the masks with Mask2Former.
Please follow Segment Anything to install the environment and download the pretrained weight to the current directory if extracting the masks with SAM.
Extract masks with Mask2Former:
$ cd ./three_steps_3d_feature/first_step
$ python maskformer_mask.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_MASKS
Extract masks with Segment Anything:
$ cd ./three_steps_3d_feature/first_step
$ python sam_mask.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_MASKS
After the first step, we are expected to obtain a directory of masks (specified by --save_dir_path
) that contains extracted masks for
multi-view images of the scenes.
Installation: The same as the following 3D-LLM_BLIP2-based
section to install salesforce-lavis.
There are four options: (1) Extract CLIP feature with Mask2Former masks; (2) Extract CLIP feature with SAM masks; (3) Extract BLIP feature with Mask2Former masks; (4) Extract BLIP feature with SAM masks.
Extract 2D CLIP features with Mask2Former masks:
$ cd ./three_steps_3d_feature/second_step/
$ python clip_maskformer.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --mask_dir_path MASK_DIR_FROM_1ST_STEP --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_FEAT
For the other options, the scripts are in similar format.
After the second step, we are expected to obtain a directory of features (specified by --save_dir_path
) that contains 2D features for
multi-view images of the scenes.
Installation:
Please install the Habitat environment.
Reconstruct 3D feature from multi-view 2D features:
$ cd ./three_steps_3d_feature/third_step/
$ python sam_mask.py --data_dir_path DATA_DIR_WITH_RGB_IMAGES --depth_dir_path DATA_DIR_WITH_DEPTH_IMAGES --feat_dir_path FEATURE_DIR_FROM_2ND_STEP
After the third step, we are expected to obtain two files (pcd_pos.pt
and pcd_feat.pt
) for each room inside the corresponding RGB directory.
pcd_pos.pt
contains the point positions of the 3D point cloud (shape: N * 3
). pcd_feat.pt
contains the point features of the 3D point cloud (shape: N * n_dim
).
N
is the number of sampled points in the point cloud (default: 300000) and n_dim
is the feature dimension (1024 for CLIP feature, 1408 for BLIP feature).
Follow the instruction in 3DLanguage_data/ChatCaptioner_based/objaverse_render/README.md
for installation.
The following code will render images of a objaverse scene (e.g. f6e9ec5953854dff94176c36b877c519). The rendered images will be saved at 3DLanguage_data/ChatCaptioner_based/objaverse_render/output
.
(Please refer to 3DLanguage_data/ChatCaptioner_based/objaverse_render/README.md
for more details about the command)
$ cd ./3DLanguage_data/ChatCaptioner_based/objaverse_render
$ {path/to/blender} -b -P render.py -noaudio --disable-crash-handler -- --uid f6e9ec5953854dff94176c36b877c519
Installation:
Please follow ChatCaptioner to install the environment/
The following code will read the rended images of an objaverse scene (e.g., f6e9ec5953854dff94176c36b877c519) and generate scene caption at 3DLanguage_data/ChatCaptioner_based/output
$ cd ./3DLanguage_data/ChatCaptioner_based
$ python chatcaption.py --specific_scene f6e9ec5953854dff94176c36b877c519
TODO
TODO
Install salesforce-lavis
$ conda create -n lavis python=3.8
$ conda activate lavis
$ git clone https://github.com/salesforce/LAVIS.git SalesForce-LAVIS
$ cd SalesForce-LAVIS
$ pip install -e .
$ pip install positional_encodings
$ cd 3DLLM_BLIP2-base
$ conda activate lavis
# use facebook/opt-2.7b:
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml
# use flant5
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_flant5_ft.yaml
TODO.
If you find our work useful, please consider citing:
@article{3dllm,
author = {Hong, Yining and Zhen, Haoyu and Chen, Peihao and Zheng, Shuhong and Du, Yilun and Chen, Zhenfang and Gan, Chuang},
title = {3D-LLM: Injecting the 3D World into Large Language Models},
journal = {arXiv},
year = {2023},
}
https://github.com/salesforce/LAVIS
https://github.com/facebookresearch/Mask2Former
https://github.com/facebookresearch/segment-anything
https://github.com/mlfoundations/open_flamingo