Training-Free Layout Control with Cross-Attention Guidance

LayoutGuidanceVid.2.mp4

Our method manage to control of layout of images generated by large pretrained Text-to-Image diffusion models without training through the layout guidance performed on the cross-attention maps.

Abstract

Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.

Environment Setup

To set up the enviroment you can easily run the following command:

conda create -n layout-guidance python=3.8
conda activate layout-guidance
pip install -r requirements.txt

Inference

We provide an example inference script. The example outputs, including log file, generated images, config file, are saved to the specified path ./example_output. Detail configuration can be found in the ./conf/base_config.yaml and inference.py.

python inference.py general.save_path=./example_output

Applications

Real Image Editing

We achieve real image editing based on Dreambooth and Text Inversion. Specifically, we can change the context, location and size of the objects in the original image.

There are 3 steps to achieve real image editing based on layout guidance. Please check the config file in ./conf/real_image_editing.yaml for more detailed configuration.

Step 1: Use text inversion to generate a special token that describes the desired object.

python text_inversion.py \
    general.save_path=./example_output/real_image_editing \
    text_inversion.image_path=./example_input/text_inversion/cat/ \
    text_inversion.initial_token='pet'

Step 2: Use Dreambooth to finetune the U-Net and text-encoder.

python dreambooth.py dreambooth.text_inversion_path=./example_output/real_image_editing/text_inversion/learned_embeds_iteration_500.bin

Step 3: Perform layout guidance on the fine-tuned text encoder and U-Net.

python inference.py \
    general.save_path=./example_output/real_image_editing/ \
    general.real_image_editing=True \
    real_image_editing.dreambooth_path=./example_output/real_image_editing/dreambooth/dreambooth_150.ckp \
    real_image_editing.text_inversion_path=./example_output/real_image_editing/text_inversion/learned_embeds_iteration_500.bin

Here are some example outputs of real image editing.

Citation

If this repo is helpful for you, please consider to cite it. Thank you! :)

@article{chen2023trainingfree,
      title={Training-Free Layout Control with Cross-Attention Guidance}, 
      author={Minghao Chen and Iro Laina and Andrea Vedaldi},
      journal={arXiv preprint arXiv:2304.03373},
      year={2023}
}

To Do List

Basic Backward Guidance
Support Different Layer of Backward Guidance
Forward Guidance
Real Image Editting Example

Acknowledgements

This research is supported by ERC-CoG UNION 101001212. The codes are inspired by Diffuser and Stable Diffusion.

silent-chen/layout-guidance