UnScene3D fully unsupervised 3D instance segmentation method, generating pseudo masks through self-supervised color and geometry features and refining them via self-training. Ultimately we achieve a 300% improvement over existing unsupervised methods, even in complex and cluttered 3D scenes and provide a powerful pretraining method.
For any code-related or other questions open an issue here or contact David Rozenberszki If you found this work helpful for your research, please consider citing our paper:
@inproceedings{rozenberszki2024unscene3d,
title={UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes},
author={Rozenberszki, David and Litany, Or and Dai, Angela},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
- Installation - setting up a conda environment and building/installing custom cpp tool s
- Data Preprocessing - we primarily use the ScanNet dataset, we have to preprocess them to get aligned point clouds and 2D images
- Pseudo Mask Generation - we generate pseudo masks using self-supervised features and extract them for self-training
- Self-Training - we mostly follow the training procedure of Mask3D, but we use the pseudo masks, noise robust losses, self-training iterations, and a class-agnostic evaluation
- Available Resources - we provide the pseudo datasets, pretrained models for evaluation and inference and loads of visualized scenes.
- Pseudo Mask Generation
- Self-Training
- Evaluation
- Upload pretrained models, datasets, visualizations and training resources
- Added docker container for easy setup
The codebase was developed and tested on Ubuntu 20.04, with various GPU versions [RTX_2080, RTX_3060, RXT_3090, RXT_A6000] and NVCC 11.6 We provide an Anaconda environment with the dependencies, to install run
conda env create -f conf/unscene3d_env.yml
conda activate unscene3d
Additionally, MinkowskiEngine and Detectron2 has to be installed manually with a specified CUDA version. E.g. for CUDA 11.6 run
export CUDA_HOME=/usr/local/cuda-11.6
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Finally, for building the custom cpp/cuda tools, run
cd utils/cpp_utils && python setup.py install
cd ../cuda_utils && python setup.py install
cd ../../third_party/pointnet2 && python setup.py install
cd ../..
Additionally we also provide dockerized container for easier requirement management. For this we recommend to download the necessary datasets and symlink it to the data
folder in this repo.
We expect to have Docker installed on your system (example here), and change the --volume
mappings in the .devcontainer/start.sh
file. Here we expect UnScene3D/data/ScanNet
be the raw ScanNet dataset, while /UnScene3D/data
should be the directory where Mask3D processed files live. Example preprocessed datasets can be downloaded from here.
Finally, one could initialize the system with
. .devcontainer/start.sh
After installing the dependencies, we preprocess the datasets.
We provide example training and inference scripts for the ScanNet dataset. For downloading the raw data, please refer to the instructions on the official GitHub page. For the pseudo mask generation, we dont need any specific data preprocessing, but have to extract the ScanNet images from their .sens files. For this please refer to the ScanNet repository SensReader, where you can use the python script to extract every Nth frame from the .sens files. In our experiments we use every 20th frame.
Additionally, we preprocess the scenes for instance segmentation annotaitions, so we dont have to parse the annotation in pseudo-mask generation and self-training stages. For this we follow the logic in extracting ScanNet200 instance masks from the LGround method.
To preprocess the ScanNet dataset, run
cd pseudo_masks/datasets/preprocess && \
python scannet200_insseg.py --input <YOUR_DOWNLOADED_SCANNET_SCENES_FOLDER> --output <YOUR_PROCESSED_SCANNET_SCENES_FOLDER>
This will also copy the official ScanNet train/val/test split files to the processed folder.
Our module combines self-supervised pretrained features from 2D/3D domains with a geometric oversegmentation of scenes, allowing for efficient representation of scene properties. We use a greedy approach with the Normalized Cut algorithm to iteratively separate foreground-background segments. After each iteration, we mask out predicted segment features and continue this process until there are no segments remaining. This method is demonstrated in the included animation.
To extract pseudo masks using both modalities and DINO features, run
cd pseudo_masks
. scripts/unscene3d_dino_2d3d.sh
The most important parameters are the following:
freemask.modality
- the modality to use for the pseudo mask generation, eithergeom
,color
orboth
freemask.affinity_tau
- the threshold for the Normalized Cut algorithm, for ablation studies check our supplementary materialdata.segments_min_vert_nums
- the minimum number of vertices in a segment after oversegmentation, for ablation studies check our supplementary materialfreemask.min_segment_size
- the minimum number of segments in a pseudo mask, for ablation studies check our supplementary materialfreemask.separation_mode
- the mode of the Normalized Cut algorithm, eithermax
,avg
,all
orlargest
, for ablation studies check our supplementary material
We use the pseudo masks generated in the previous step to train the model in a self-training manner. We use the pseudo masks as ground truth and train the model with a noise robust loss, which is a combination of the standard cross-entropy loss and the Dice loss with low quality matches filtered by a 3D version of DropLoss. First we have to format the data for the self-train cycles. In this part of the code we rely the wast majority on the Mask3D codebase, with some minor modifications and also follow their logic on the training.
To save the datasets for self-training, run
python datasets/preprocessing/freemask_preprocessing.py preprocess
The most important parameters are the following:
--data_dir
- the path to the raw ScanNet dataset--save_dir
- the path to save the processed data--modes
- default is ("train", "validation"), but can be selected for either--git_repo
- the path to the ScanNet git repository, needed for the ScanNet label mapping--oracle
- if selected, the original annotation is used for pseudo masks in a class-agnostic manner. needed to create a version for evaluation--freemask_dir
- the path to the pseudo masks generated in the previous step--n_jobs
- makes you wait less if you use more cores :)
Finally, to train the model with the pseudo masks over multiple stages of self-training iterations, run
. scripts/mask3d_DINO_CSC_self_train.sh
We provide the pretrained weights for the CSC model, which is used for self-superivsed feature extraction. This was trained on teh training scenes of ScanNet, with default parameters.
We preprocessed a set of pseudo datasets in different variations, which can be used for self-training. We provide the following datasets:
Dataset Name | Description |
---|---|
scannet_freemask_oracle | The oracle dataset using the GT ScanNet mask annotation, mostly used for evaluation only. |
unscene3d_dino | Our proposed psuedo mask dataset, using projected 2D features from DINO for the NCut stage. |
unscene3d_dino_csc | Our proposed psuedo mask dataset, using both 3D CSC and projected 2D features from DINO for the NCut stage. |
unscene3d_arkit | Using 3D features on the ArKitScenes dataset for the NCut stage. |
We also provide the trained checkpoints for the self-training iterations, which can be used for evaluation or inference purposes. The checkpoints and the corresponding training logs are available for the following setups:
Setup Name | Description |
---|---|
unscene3d_CSC_self_train_3 | The model trained with pseudo masks using 3D features only, and after 3 self-training iteration. |
unscene3d_DINO_self_train_3 | The model trained with pseudo masks using 2D features only, and after 3 self-training iteration. |
unscene3d_DINO_CSC_self_train_3 | The model trained with pseudo masks using both 2D and 3D features only, and after 3 self-training iteration. |
unscene3d_arkit_self_train_2 | The model trained with the ArKitScenes and pseudo masks extracted from 3D features only, and after 2 self-training iterations. |
Finally, we show some qualitative results of the pseudo mask generation and the self-training iterations for the different setups. You can download the visualizations for 3D only and both 2D/3D psuedo masks. For opening the visualizations in your browser, please use PyViz3D.