This repository is for the paper "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training" (CVPR 2023)
3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering.
Our codes are built based on ScanRefer, 3DJCG and ScanQA codebase. Please refer to them for more detailed data preprocessing instructions.
- Download the ScanRefer dataset and unzip it under
data/
. - Download the ScanQA dataset under
data/qa/
. - Download the preprocessed GLoVE embeddings (~990MB) and put them under
data/
. - Download the ScanNetV2 dataset and put (or link)
scans/
under (or to)data/scannet/scans/
(Please follow the ScanNet Instructions for downloading the ScanNet dataset).
After this step, there should be folders containing the ScanNet scene data under the
data/scannet/scans/
with names likescene0000_00
- Pre-process ScanNet data. A folder named
scannet_data/
will be generated underdata/scannet/
after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py
After this step, you can check if the processed scene data is valid by running:
python visualize.py --scene_id scene0000_00
- (Optional) Pre-process the multiview features from ENet.
-
Download: Download the ENet multiview features (~36GB, hdf5 database) and put it under
data/scannet/scannet_data/
-
Projection:
a. Download the ENet pretrained weights (1.4MB) and put it under
data/
b. Download and decompress the extracted ScanNet frames (~13GB). c. Change the data paths inlib/config.py
marked with TODO accordingly. d. Project ENet features from ScanNet frames to point clouds (~36GB, hdf5 database).
python script/multiview_compute/compute_multiview_features.py python script/multiview_compute/project_multiview_features.py --maxpool --gpu 1
The codes are tested on Ubuntu 20.04.1 LTS with PyTorch 1.8.0 and CUDA 11.1 installed.
Create and activate a conda environment, for example:
conda create -n 3D-VLP python=3.6
conda activate 3D-VLP
Install pytorch:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
Install the required packages listed in requirements.txt
:
pip install -r requirements.txt
Run the following commands to compile the CUDA modules for the PointNet++ backbone:
cd lib/pointnet2
python setup.py install
To pre-train the model, run the following command:
sh scripts/pretrain.sh
The pre-trained models will be saved under outputs/exp_pretrain/
.
Fine-tune the model on ScanRefer dataset for 3D visual grounding and dense captioning:
sh scripts/finetune_scanrefer.sh
Fine-tune the model on ScanQA for 3D question answering:
sh scripts/finetune_scanqa.sh
Before evaluation, please specify the <folder_name> (outputs/ with the timestamp + <tag_name>) of the fine-tuned model and then run the following commands. For 3D visual grounding:
sh scripts/eval_ground.sh
For 3D dense captioning:
sh scripts/eval_cap.sh
For 3D question answering:
sh scripts/eval_qa.sh
The visualization results of point clouds are obtained through MeshLab.
@inproceedings{jin2023context,
title={Context-aware Alignment and Mutual Masking for 3D-Language Pre-training},
author={Jin, Zhao and Hayat, Munawar and Yang, Yuwei and Guo, Yulan and Lei, Yinjie},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10984--10994},
year={2023}
}
We would like to thank facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.
This repository is released under MIT License.