[ICLR 2024 spotlight] InstructScene

InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

Chenguo Lin, Yadong Mu

This repository contains the official implementation of the paper: InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior, which is accepted by ICLR 2024 for spotlight presentation. InstructScene is a generative framework to synthesize 3D indoor scenes from instructions. It is composed of a semantic graph prior and a layout decoder.

Feel free to contact me (chenguolin@stu.pku.edu.cn) or open an issue if you have any questions or suggestions.

📢 News

2024-02-28: The pretrained weights of fVQ-VAE are released.
2024-02-28: The source code and preprocessed dataset are released.
2024-02-07: The paper is available on arXiv.
2024-01-16: InstructScene is accepted by ICLR 2024 for spotlight presentation.

📋 TODO

Release the training and evaluation code
Release the preprocessed dataset and rendered images on HuggingFace
Release the pretrained weights of fVQ-VAE to quantize OpenShape features of 3D-FRONT objects
Release the dataset preprocessing scripts (ChatGPT API usage, quantization of OpenShape features, etc.)
Add an "FAQ" section to provide detailed explanations on the implementation

🔧 Installation

You may need to modify the specific version of torch in settings/setup.sh according to your CUDA version. There are not restrictions on the torch version, feel free to use your preferred one.

git clone https://github.com/chenguolin/InstructScene.git
cd InstructScene
bash settings/setup.sh

Download the Blender software for visualization.

cd blender
wget https://download.blender.org/release/Blender3.3/blender-3.3.1-linux-x64.tar.xz
tar -xvf blender-3.3.1-linux-x64.tar.xz
rm blender-3.3.1-linux-x64.tar.xz

📊 Dataset

Dataset used in InstructScene is based on 3D-FORNT and 3D-FUTURE. Please refer to the instructions provided in their official website to download the original dataset. One can refer to the dataset preprocessing scripts in ATISS and DiffuScene, which are similar to ours.

We provide the preprocessed instruction-scene paired dataset used in the paper and rendered images for evaluation on HuggingFace.

import os
from huggingface_hub import hf_hub_url
url = hf_hub_url(repo_id="chenguolin/InstructScene_dataset", filename="InstructScene.zip", repo_type="dataset")
os.system(f"wget {url} && unzip InstructScene.zip")
url = hf_hub_url(repo_id="chenguolin/InstructScene_dataset", filename="3D-FRONT.zip", repo_type="dataset")
os.system(f"wget {url} && unzip 3D-FRONT.zip")

Please refer to dataset/README.md for more details.

👀 Visualization

We provide a helpful script to visualize synthesized scenes by Blender. Please refer to blender/README.md for more details.

We also provide many useful visualization functions in src/utils/visualize.py, including creating appropriate floor plans, drawing scene graphs, adding instructions as titles in the rendered images, making gifs, etc.

🚀 Usage

Note that:

All scripts in this project are executed in only one GPU. It takes 1~3 days to train the semantic graph prior or layout decoder on a single NVIDIA A40 GPU depending on the room type.
We use TensorBoard to track the training process by executing tensorboard --logdir out/.
The training of "1. layout decoder" and "2. semantic graph prior" are independent and can be trained parallelly, as we use ground-truth semantic graphs to train the layout decoder. During inference, to render syntheiszed scenes from instruction prompts, one needs to have both the semantic graph prior and the layout decoder trained.

0️. 📦 fVQ-VAE: quantize OpenShape/CLIP features of objects

Training

We provide the pretrained weights of fVQ-VAE on HuggingFace. Our preprocessed dataset contains the original OpenShape features and correspondingly quantization indices.

import os
from huggingface_hub import hf_hub_url
os.system("mkdir -p out/threedfront_objfeat_vqvae/checkpoints")
url = hf_hub_url(repo_id="chenguolin/InstructScene_dataset", filename="threedfront_objfeat_vqvae_epoch_01999.pth", repo_type="dataset")
os.system(f"wget {url} -O out/threedfront_objfeat_vqvae/checkpoints/epoch_01999.pth")

You can also train the fVQ-VAE from scratch. However, you should update the quantization indices in the dataset (stored in dataset/InstructScene/threed_front_<room_type>/<scene_id>/models_info.pkl) accordingly.

# bash scripts/train_objfeatvqvae.sh <tag> <gpu_id>
bash scripts/train_objfeatvqvae.sh threedfront_objfeat_vqvae 0

Inference (only for debugging)

# bash scripts/inference_objfeatvqvae.sh <tag> <gpu_id> <epoch>
bash scripts/inference_objfeatvqvae.sh threedfront_objfeat_vqvae 0 -1
# '-1' means the latest checkpoint

1️. 🦾 Layout Decoder: embody 3D scenes from semantic graphs

Training

# bash scripts/train_sg2sc_objfeat.sh <room_type> <tag> <gpu_id> <fvqvae_tag>
bash scripts/train_sg2sc_objfeat.sh bedroom bedroom_sg2scdiffusion_objfeat 0 threedfront_objfeat_vqvae

Inference (only for debugging)

# bash scripts/inference_sg2sc_objfeat.sh <room_type> <tag> <gpu_id> <epoch> <fvqvae_tag> <(optional) cfg_scale>
bash scripts/inference_sg2sc_objfeat.sh bedroom bedroom_sg2scdiffusion_objfeat 0 -1 threedfront_objfeat_vqvae 1.0

To visualize synthesized scenes, replace --n_scene 0 in scripts/inference_sg2sc_objfeat.sh to --n_scenes 5 --visualize --resolution 1024, which means to visualize 5 synthesized scenes and save the rendered images with a resolution of 1024x1024. Otherwise, it will only compute the iRecall score for evaluation.

2️. 🤖 Semantic Graph Prior: design semantic graphs from instructions

Training

# bash scripts/train_sg_vq_objfeat.sh <room_type> <tag> <gpu_id>
bash scripts/train_sg_vq_objfeat.sh bedroom bedroom_sgdiffusion_vq_objfeat 0

Inference

# bash scripts/inference_sg_vq_objfeat.sh <room_type> <tag> <gpu_id> <epoch> <fvqvae_tag> <sg2sc_tag> <(optional) cfg_scale> <(optional) sg2sc_cfg_scale>
bash scripts/inference_sg_vq_objfeat.sh bedroom bedroom_sgdiffusion_vq_objfeat 0 -1 threedfront_objfeat_vqvae bedroom_sg2scdiffusion_objfeat 1.0 1.0

To visualize synthesized scenes, replace --n_scene 0 in scripts/inference_sg_vq_objfeat.sh to --n_scenes 5 --visualize --resolution 1024, which means to visualize 5 synthesized scenes and save the rendered images with a resolution of 1024x1024. Otherwise, it will only compute the iRecall score for evaluation.

Evaluation

Evaluation should be conducted after the inference script is executed with the --visualize flag, which will save the rendered images in the output directory.

FID, CLIP-FID and KID

python3 src/compute_fid_scores.py configs/bedroom_sgdiffusion_vq_objfeat.yaml --tag bedroom_sgdiffusion_vq_objfeat --checkpoint_epoch -1

SCA (scene classification accuracy)

python3 src/synthetic_vs_real_classifier.py configs/bedroom_sgdiffusion_vq_objfeat.yaml --tag bedroom_sgdiffusion_vq_objfeat --checkpoint_epoch -1

Applications

Replace the python file name in scripts/inference_sg_vq_objfeat.sh from generate_sg.py to stylize_sg.py, rearrange_sg.py or complete_sg.py for "stylization", "rearrangement" or "completion" downstream tasks, respectively.

Please refer to these python files for more detailed arguments and usage.

😊 Acknowledgement

We would like to thank the authors of ATISS, DiffuScene, OpenShape, NAP and CLIPLayout for their great work and generously providing source codes, which inspired our work and helped us a lot in the implementation.

📚 Citation

If you find our work helpful, please consider citing:

@inproceedings{lin2024instructscene,
  title={InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior},
  author={Chenguo Lin and Yadong Mu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

evinpinar/InstructScene

[ICLR 2024 spotlight] InstructScene

InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior Chenguo Lin, Yadong Mu

📢 News

📋 TODO

🔧 Installation

📊 Dataset

👀 Visualization

🚀 Usage

0️. 📦 fVQ-VAE: quantize OpenShape/CLIP features of objects

Training

Inference (only for debugging)

1️. 🦾 Layout Decoder: embody 3D scenes from semantic graphs

Training

Inference (only for debugging)

2️. 🤖 Semantic Graph Prior: design semantic graphs from instructions

Training

Inference

Evaluation

FID, CLIP-FID and KID

SCA (scene classification accuracy)

Applications

😊 Acknowledgement

📚 Citation

InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

Chenguo Lin, Yadong Mu