Jinheng Xie1 Yuexiang Li2 Yawen Huang2 Haozhe Liu2,3 Wentian Zhang2 Yefeng Zheng2 Mike Zheng Shou1
1 National University of Singapore 2 Tencent Jarvis Lab 3 KAUST
Note that we only test the code using PyTorch==1.12.0. You can build the environment via pip
as follow:
pip3 install -r requirements.txt
To apply BoxDiff on GLIGEN pipeline, please install diffusers as follow:
git clone git@github.com:gligen/diffusers.git
pip3 install -e .
To add spatial control on the Stable Diffusion model, you can simply use run_sd_boxdiff.py
. For example:
CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "as the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed" --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30] --token_indices [3,12,21,30,46] --bbox [[1,3,512,202],[75,344,421,495],[1,327,508,507],[2,217,507,341],[1,135,509,242]] --refine False
or another example:
CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud" --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4] --bbox [[67,87,366,512],[66,130,364,262]]
Note that you can specify the token indices as the indices of words you want control in the text prompt and one token index has one corresponding conditoning box. P
and L
are hyper-parameters for the proposed constraints.
When --bbox
is not specified, there is a interface to draw bounding boxes as conditions.
CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud" --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4]
To add spatial control on the GLIGEN model, you can simply use run_gligen_boxdiff.py
. For example:
CUDA_VISIBLE_DEVICES=0 python3 run_gligen_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud" --gligen_phrases ["a rabbit","sunglasses"] --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4] --bbox [[67,87,366,512],[66,130,364,262]] --refine False
The direcory structure of synthetic results are as follows:
outputs/
|-- text prompt/
| |-- 0.png
| |-- 0_canvas.png
| |-- 1.png
| |-- 1_canvas.png
| |-- ...
VisorGPT can customize layouts as spatial conditions for image synthesis using BoxDiff.
@InProceedings{Xie_2023_ICCV,
author = {Xie, Jinheng and Li, Yuexiang and Huang, Yawen and Liu, Haozhe and Zhang, Wentian and Zheng, Yefeng and Shou, Mike Zheng},
title = {BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2023},
pages = {7452-7461}
}
Acknowledgment - the code is highly based on the repository of diffusers, google, and yuval-alaluf.