RUSSO: 3D Reference Understanding Transformers with Less Boxes

Official implementation of 《RUSSO: 3D Reference Understanding Transformers with Less Boxes》

Install

Requirements

We showcase the installation for CUDA 11.1 and torch==1.10.2, which is what we used for our experiments. If you need to use a different version, you can try to modify environment.yml accordingly.

Install environment: conda env create -f environment.yml --name russo
Activate environment: conda activate russo
Install torch: pip install -U torch==1.10.2 torchvision==0.11.3 --extra-index-url https://download.pytorch.org/whl/cu111
install extra package, bash my_shell_scripts/install.sh
Compile the CUDA layers for PointNet++, which we used in the backbone network: sh init.sh

Data preparation

Download ScanNet v2 data HERE. Let DATA_ROOT be the path to folder that contains the downloaded annotations. Under DATA_ROOT there should be a folder scans. Under scans there should be folders with names like scene0001_01. We provide a script to download only the relative annotations for our task. Run python scripts/download_scannet_files.py. Note that the original ScanNet script is written for python2.
Download ReferIt3D annotations following the instructions HERE. Place all .csv files under DATA_ROOT/refer_it_3d/.
Download ScanRefer annotations following the instructions HERE. Place all files under DATA_ROOT/scanrefer/.
prepare the dataset directory tree referring to my_shell_scripts/install2.sh

(optional) Download PointNet++ checkpoint into DATA_ROOT.
Run python prepare_data.py --data_root DATA_ROOT specifying your DATA_ROOT. This will create two .pkl files and has to only run once.

Usage

bash my_shell_scripts/pretrain_cls.sh to pretrain cls model (nr3d,sr3d)
bash my_shell_scripts/train_cls.sh to semi train cls model (nr3d,sr3d)
bash my_shell_scripts/eval_cls.sh to eval cls model
bash my_shell_scripts/pretrain_det.sh to pretrain det model (scanrefer)
bash my_shell_scripts/train_det.sh to semi train det model (scanrefer)
bash my_shell_scripts/eval_det.sh to eval det model
Note: On NR3D and ScanRefer we need much more training epochs to converge. It's better to monitor the validation accuracy and decrease learning rate accordingly.

Pre-trained checkpoints

Our chekcpoints are available as follows:

ScanRefer

Labeled	Unlabeled	Model (baidu pan)	extract code
20%		https://pan.baidu.com/s/1Tq_cvj35KYtJuc8pKEZCJA	bft2
20%	80%	https://pan.baidu.com/s/1oSmAPp59CVzVuwZPeL1BMA	twkp
30%		https://pan.baidu.com/s/16wkTh5FwALmSNiRr9DnafQ	a958
30%	70%	https://pan.baidu.com/s/1DkVUhYaFvC_o36900qZCtQ	phyc
50%		https://pan.baidu.com/s/1EGqGnNSSOTCDs0OS0iwmyg	h7b4
50%	50%	https://pan.baidu.com/s/1HeJMf7m0rpMvZv16q7MRNg	rzw3
100%		https://pan.baidu.com/s/1YwNyJrAdA0IFa1lwkV_yQA	y4yi
100%	100%	https://pan.baidu.com/s/1R0hGWZmmrd6YLUJJvacnyw	gwgg

Nr3D

Labeled	Unlabeled	Model (baidu pan)	extract code
20%		https://pan.baidu.com/s/17KrmeeMFeeg6hB3PLziaBQ	ehfk
20%	80%	https://pan.baidu.com/s/1z_MB9YJq7wGQVpoIVAHAHg	my44
30%		https://pan.baidu.com/s/1Jkj8gxq-dHj60bIl7qIaOA	kdgt
30%	70%	https://pan.baidu.com/s/1FPurxPrn5AYOKEXs5t-UhA	mbbj
50%		https://pan.baidu.com/s/11rKbrj33s2AKRRfmcwO1rQ	ujvw
50%	50%	https://pan.baidu.com/s/17TIIQWD4g0Ltg4swj59t7w	e7j8
100%		https://pan.baidu.com/s/1XfkaATThKX3oHPONuHN7gQ	k83c
100%	100%	https://pan.baidu.com/s/1aYNVLAjaheZCby9dliIU3w	1gm1

Sr3D

Labeled	Unlabeled	Model (baidu pan)	extract code
20%		https://pan.baidu.com/s/1V2jxN4wmh75u5fG8YkKLrA	b7ka
20%	80%	https://pan.baidu.com/s/1wYgUucmASmdh6ExBtyoN5Q	str6
30%		https://pan.baidu.com/s/1nsJZHEyX_Jk-MyINM8Fg3g	1sx3
30%	70%	https://pan.baidu.com/s/1VEiJHzkkD5LDnN_FQEIANg	pd6v
50%		https://pan.baidu.com/s/1WoAFfLCNudeLK2UeINtSkQ	rbzc
50%	50%	https://pan.baidu.com/s/1Os3OQUEmfgBGQIwHI8Rt6A	hwba
100%		https://pan.baidu.com/s/1AP8i1vXx5BgQI-dXZUSs6w	piff
100%	100%	https://pan.baidu.com/s/1AP8i1vXx5BgQI-dXZUSs6w	piff

Note that the model is trained in EB-DETR if the Unlabeled is equal to blank, otherwise trained in EB-DETR with RUSSO.

Note that these checkpoints were stored while using DistributedDataParallel. To use them outside these checkpoints without DistributedDataParallel, take a look here.

How does the evaluation work?

For each object query, we compute per-token confidence scores and regress bounding boxes.
Given a target span, we keep the most confident query for it. This is our model's best guess.
We compute the IoU of the corresponding box and the ground-truth box.
We check whether this IoU is greater than the thresholds (0.25, 0.5).

Acknowledgements

Parts of this code were based on the codebase of BUTD-DETR.

Citing ROSSO

If you find RUSSO useful in your research, please consider citing:

Daniellli/russo