CommonScenes

This is the official implementation of the paper CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion. Based on diffusion model, we propose a method to generate entire 3D scene from scene graphs, encompassing its layout and 3D geometries holistically.

Website | Arxiv

Guangyao Zhai *, Evin Pınar Örnek *, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. (*Equal contribution.)
NeurIPS 2023

Setup

Environment

Download the code and go the folder.

git clone https://github.com/ymxlzgy/commonscenes
cd commonscenes

We have tested it on Ubuntu 20.04 with Python 3.8, PyTorch 1.11.0, CUDA 11.3 and Pytorch3D.

conda create -n commonscenes python=3.8
conda activate commonscenes
pip install -r requirements.txt 
pip install einops omegaconf tensorboardx open3d

To install CLIP, follow this OpenAI CLIP repo:

pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Setup additional Chamfer Distance calculation for evaluation:

cd ./extension
python setup.py install

Dataset

Download the 3D-FRONT dataset from their official site.
Preprocess the dataset following ATISS.
Download 3D-FUTURE-SDF. This is processed by ourselves on the 3D-FUTURE meshes using tools in SDFusion.
Follow this page for downloading SG-FRONT and accessing more information.
Create a folder named FRONT, and copy all files to it.

The structure should be similar like this:

FRONT
|--3D-FRONT
|--3D-FRONT_preprocessed (by ATISS)
|--threed_front.pkl (by ATISS)
|--3D-FRONT-texture
|--3D-FUTURE-model
|--3D-FUTURE-scene
|--3D-FUTURE-SDF
|--All SG-FRONT files (.json and .txt)

Models

Essential: Download pretrained VQ-VAE model from here to the folder scripts/checkpoint.

Optional: We provide two trained models of CommonScenes available here.

Training

To train the models, run:

cd scripts
python train_3dfront.py --exp /media/ymxlzgy/Data/graphto3d_models/balancing/all --room_type livingroom --dataset /path/to/FRONT --residual True --network_type v2_full --with_SDF True --with_CLIP True --batchSize 4 --workers 4 --loadmodel False --nepoch 10000 --large False

--room_type: rooms to train, e.g., livingroom, diningroom, bedroom, and all. We train all rooms together in the implementation.

--network_type: the network to be trained. v1_box is Graph-to-Box, v1_full is Graph-to-3D (DeepSDF version), v2_box is the layout branch of CommonScenes, and v2_full is CommonScenes. (Note:If you want to train v1_full, addtional reconstructed meshes and codes by DeepSDF should also be downloaded from here, and also copy to FRONT).

--with_SDF: set to True if train v2_full.

--with_CLIP : set to True if train v2_box or v2_full, and not used in other cases.

--batch_size: the batch size for the layout branch training. (Note: the one for the shape branch is in v2_full.yaml and v2_full_concat.yaml. The meaning of each batch size can be found in the Supplementary Material G.1.)

--large : default is False, True means more concrete categories.

We provide three examples here: Graph-to-3D (DeepSDF version), Graph-to-Box, CommonScenes. The recommanded GPU is a single A100 for CommonScenes, though 3090 can also train the network with a lower batch size on the shape branch.

Evaluation

To evaluate the models run:

cd scripts
python eval_3dfront.py --exp /media/ymxlzgy/Data/graphto3d_models/balancing/all --epoch 180 --visualize False --evaluate_diversity False --num_samples 5 --gen_shape False --no_stool True

--exp: where you store the models.

--gen_shape: set True if you want to make diffusion-based shape branch work.

--evaluate_diversity: set True if you want to compute diversity. This takes a while, so it's disabled by default.

--num_samples: the number of experiment rounds, when evaluate the diversity.

FID/KID

This metric aims to evaluate scene-level fidelity. To evaluate FID/KID, you first need to download top-down gt rendered images for retrieval methods and sdf rendered images for generative methods, or collect renderings by modifying and running collect_gt_sdf_images.py. Note that the flag without_lamp is set to True in our experiment.

Make sure you download all the files and preprocess the 3D-FRONT. The renderings of generated scenes can be obtained inside eval_3dfront.py.

After obtaining both ground truth images and generated scenes renderings, run compute_fid_scores_3dfront.py.

MMD/COV/1-NN

This metric aims to evaluate object-level fidelity. Please follow the implementation in PointFlow. To evaluate this, you need to store object by object in the generated scenes, which can be done in eval_3dfront.py.

After obtaining object meshes, run compute_mmd_cov_1nn.py to have the results.

Acknowledgements

If you find this work useful in your research, please cite

@inproceedings{
  zhai2023commonscenes,
  title={CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion},
  author={Zhai, Guangyao and {\"O}rnek, Evin P{\i}nar and Wu, Shun-Cheng and Di, Yan and Tombari, Federico and Navab, Nassir and Busam, Benjamin},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=1SF2tiopYJ}
}

This repository is based on Graph-to-3D and SDFusion. We thank the authors for making their code available.

Disclaimer

Tired students finished the pipeline in busy days...