Open-Vocabulary Occupancy

This repo is the official implementation of OVO: Open-Vocabulary Occupancy

OVO: Open-Vocabulary Occupancy

Zhiyu Tan*, Zichao Dong*, Cheng Zhang, Weikun Zhang, Hang Ji, Hao Li $\dagger$

Introduction

Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset.

We propose Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models.

Preparing OVO

Installation

Create conda environment:

$ conda create -y -n ovo python=3.7
$ conda activate ovo

Install pytorch:

$ conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch

Install the additional dependencies:
```
$ pip install -r requirements.txt
```
Install tbb:
```
$ conda install -c bioconda tbb=2020.2
```
Downgrade torchmetrics to 0.6.0:
```
$ pip install torchmetrics==0.6.0
```
Finally, install OVO:
```
$ pip install -e .
```

Data Preprocess

Generate LSeg embedding.

Refer to LSeg.
Generate prompt embedding.

Refer to CLIP and get_prompt_embedding.py.

Or directly use prompt_embeddings offered in this repository.

Label preprocess.

NYUv2 ov labels (Used for training):

Change seg_class_map in ovo/data/NYU/preprocess_ov.py

In this repository we offer an example of merging 'bed', 'table' and ' other' into 'other'.

python ovo/data/NYU/preprocess_ov.py NYU_root=/path/to/NYU_dataset/depthbin/ NYU_preprocess_root=/path/to/nyu_preprocess_ov

SemanticKITTI ov labels (Used for training):

Change learning_map_inv in ovo/data/semantic_kitti/semantic-kitti.yaml

In this repository we offer an example of merging 'car', 'road' and ' building' into 'road'.

python ovo/data/semantic_kitti/preprocess_ov.py kitti_root=/path/to/kitti_dataset/ kitti_preprocess_root=/path/to/kitti_preprocess_ov

NYUv2 ori labels (Used for inference):

python ovo/data/NYU/preprocess_ori.py NYU_root=/path/to/NYU_dataset/depthbin/ NYU_preprocess_root=/path/to/nyu_preprocess_ori

SemanticKITTI ori labels (Used for inference):

python ovo/data/semantic_kitti/preprocess_ov.py kitti_root=/path/to/kitti_dataset/ kitti_preprocess_root=/path/to/kitti_preprocess_ori

Occlusion preprocess.

python ovo/occlusion_preprocess/find_occ_pairs_kitti.py /path/to/kitti_preprocess_ov

python ovo/occlusion_preprocess/find_occ_pairs_nyu.py /path/to/nyu_preprocess_ov/base/NYUtrain/

Voxel selection.

Filling the path parameters in ovo/data/NYU/nyu_valid_pairs.py
```
python ovo/data/NYU/nyu_valid_pairs.py
```
Filling the path parameters in ovo/data/NYU/nyu_valid_pairs.py
```
python ovo/data/semantic_kitti/kitti_valid_pairs.py
```
Integrate all pre-processed data.

Filling the path parameters in ovo/data/NYU/prepare_total.py
```
python ovo/data/NYU/prepare_total.py
```
Filling the path parameters in ovo/data/semantic_kitti/prepare_total.py
```
python ovo/data/semantic_kitti/prepare_total.py
```

Training OVO

NYUv2

# train_nyu.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python ./ovo/scripts/train_ovo.py \
dataset=NYU \
NYU_root=/path/to/NYU_dataset/depthbin/ \
NYU_preprocess_root=/path/to/nyu_preprocess_ov \
NYU_prepare_total=/path/to/nyu_preprocess_total \
logdir=./outputs \
n_gpus=8 batch_size=8

SemanticKITTI

# train_kitti.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python ./ovo/scripts/train_ovo.py \
dataset=kitti \
kitti_root=/path/to/kitti_dataset/ \
kitti_preprocess_root=/path/to/kitti_preprocess_ov/ \
kitti_prepare_total=/path/to/kitti_preprocess_total \
logdir=./outputs \
n_gpus=8 batch_size=8

Inference

NYUv2

# infer_nyu.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python ovo/scripts/infer_ovo.py \
dataset=NYU \
NYU_root=/path/to/NYU_dataset/depthbin/ \
NYU_preprocess_root=/path/to/nyu_preprocess_ori \
+word_path=ovo/prompt_embedding/nyu_prompt_embedding.json \
+model_path=/path/to/model_file/last.ckpt \
+output_path=/data/visualization_file/ \
+novel_class_lbl=[6,8,11] \
+target_lbl=11 \
n_gpus=1 batch_size=1 \
vis=True miou=True

SemanticKITTI

# infer_kitti.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python ovo/scripts/infer_ovo.py \
dataset=kitti \
kitti_root=/path/to/kitti_dataset/ \
kitti_preprocess_root=/path/to/kitti_preprocess_ori/ \
+word_path=ovo/prompt_embedding/kitti_prompt_embedding.json \
+model_path=/path/to/model_file/last.ckpt \
+output_path=/path/to/visualization_file/ \
+novel_class_lbl=[1,9,13] \
+target_lbl=9 \
n_gpus=1 batch_size=1 \
vis=True miou=True

Visualization

Refer to MonoScene visualization

Filling the path parameters in ovo/scripts/visualization/nyu_vis.py

python ovo/scripts/visualization/nyu_vis.py

Filling the path parameters in ovo/scripts/visualization/kitti_vis.py

python ovo/scripts/visualization/kitti_vis.py

Main results

NYUv2

Method	Input	bed	table	other	mean	ceiling	floor	wall	window	chair	sofa	tvs	furniture	mean
Fully-supervised
AICNet	C, D	35.87	11.11	6.45	17.81	7.58	82.97	9.15	0.05	6.93	22.92	0.71	15.90	18.28
SSCNet	C, D	32.10	13.00	10.10	18.40	15.10	94.70	24.40	0.00	12.60	35.0	7.80	27.10	27.10
3DSketch	C	42.29	13.88	8.19	21.45	8.53	90.45	9.94	5.67	10.64	29.21	9.38	23.83	23.46
MonoScene	C	48.19	15.13	12.94	25.42	8.89	93.50	12.06	12.57	13.72	36.11	15.22	27.96	27.50
Zero-shot
MonoScene*	C	--	--	--	--	8.10	93.49	9.94	10.32	13.24	34.47	11.75	26.41	25.96
ours	C	41.61	10.45	8.39	20.15	7.77	93.16	7.77	6.95	10.01	33.83	8.22	25.64	24.17

SemanticKITTI

Method	Input	car	road	building	mean	sidewalk	parking	other ground	truck	bicycle	motorcycle	other vehicle	vegetation	trunk	terrain	person	bicyclist	motorcyclist	fence	pole	traffic sign	mean
Fully-supervised
AICNet	C, D	15.3	39.3	9.6	21.4	18.3	19.8	1.6	0.7	0.0	0.0	0.0	9.6	1.9	13.5	0.0	0.0	0.0	5.0	0.1	0.0	4.4
3DSketch	C $\dagger$	17.1	37.7	12.1	22.3	19.8	0.0	0.0	0.0	0.0	0.0	0.0	12.1	0.0	16.1	0.0	0.0	0.0	3.4	0.0	0.0	3.2
MonoScene	C	18.8	54.7	14.4	29.3	27.1	24.8	5.7	3.3	0.5	0.7	4.4	14.9	2.4	19.5	1.0	1.4	0.4	11.1	3.3	2.1	7.7
TPVFormer	C×6	23.8	56.5	13.9	31.4	25.9	20.6	0.9	8.1	0.4	0.1	4.4	16.9	2.3	30.4	0.5	0.9	0.0	5.9	3.1	1.5	7.6
Zero-shot
ours	C	13.3	53.9	9.7	25.7	26.5	14.4	0.1	0.7	0.4	0.3	2.5	17.2	2.3	29.0	0.6	0.7	0.0	5.4	3.0	1.7	6.6

Related projects

Our code is based on MonoScene. Many thanks to the authors for their great work.

Citation

If you find this project helpful, please consider citing the following paper:

@misc{tan2023ovo,
      title={OVO: Open-Vocabulary Occupancy}, 
      author={Zhiyu Tan and Zichao Dong and Cheng Zhang and Weikun Zhang and Hang Ji and Hao Li},
      year={2023},
      eprint={2305.16133},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

lukechencqu/OVO-Open-Vocabulary-Occupancy

Open-Vocabulary Occupancy

Introduction

Preparing OVO

Training OVO

Inference

Visualization

Main results

Related projects

Citation