/3D-VLP

This is the code related to "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training" (CVPR 2023).

Primary LanguagePythonMIT LicenseMIT

Context-aware Alignment and Mutual Masking for 3D-Language Pre-training

This repository is for the paper "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training" (CVPR 2023)

Abstract

3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering.

Dataset & Setup

Data preparation

Our codes are built based on ScanRefer, 3DJCG and ScanQA codebase. Please refer to them for more detailed data preprocessing instructions.

  1. Download the ScanRefer dataset and unzip it under data/.
  2. Download the ScanQA dataset under data/qa/.
  3. Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  4. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.
python script/multiview_compute/compute_multiview_features.py
python script/multiview_compute/project_multiview_features.py --maxpool --gpu 1

Setup

The codes are tested on Ubuntu 20.04.1 LTS with PyTorch 1.8.0 and CUDA 11.1 installed.

Create and activate a conda environment, for example:

conda create -n 3D-VLP python=3.6
conda activate 3D-VLP

Install pytorch:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

Install the required packages listed in requirements.txt:

pip install -r requirements.txt

Run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Usage

Pre-training

To pre-train the model, run the following command:

sh scripts/pretrain.sh

The pre-trained models will be saved under outputs/exp_pretrain/.

Fine-tuning

Fine-tune the model on ScanRefer dataset for 3D visual grounding and dense captioning:

sh scripts/finetune_scanrefer.sh

Fine-tune the model on ScanQA for 3D question answering:

sh scripts/finetune_scanqa.sh

Evaluate

Before evaluation, please specify the <folder_name> (outputs/ with the timestamp + <tag_name>) of the fine-tuned model and then run the following commands. For 3D visual grounding:

sh scripts/eval_ground.sh

For 3D dense captioning:

sh scripts/eval_cap.sh

For 3D question answering:

sh scripts/eval_qa.sh

Results

3D visual grounding

3D dense captioning

3D question answering

The visualization results of point clouds are obtained through MeshLab.

Citation

@inproceedings{jin2023context,
  title={Context-aware Alignment and Mutual Masking for 3D-Language Pre-training},
  author={Jin, Zhao and Hayat, Munawar and Yang, Yuwei and Guo, Yulan and Lei, Yinjie},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10984--10994},
  year={2023}
}

Acknowledgement

We would like to thank facebookresearch/votenet for the 3D object detection and daveredrum/ScanRefer for the 3D localization codebase.

License

This repository is released under MIT License.