/ControlAR

Official code for "ControlAR: Controllable Image Generation with Autoregressive Models"

Primary LanguagePythonApache License 2.0Apache-2.0

ControlAR

Controllable Image Generation with Autoregressive Models

Zongming Li1,*, Tianheng Cheng1,*, Shoufa Chen2, Peize Sun2, Haocheng Shen3,Longjin Ran3, Xiaoxin Chen3, Wenyu Liu1, Xinggang Wang1,📧

1 Huazhong University of Science and Technology, 2 The University of Hong Kong 3 vivo AI Lab

(* equal contribution, 📧 corresponding author)

arxiv paper demo checkpoints

News

[2024-12-12]: We introduce control strength factor, employ a larger control encoder(dinov2-base), and optimize text alignment capabilities along with generation diversity. New model weight: depth_base.safetensors and edge_base.safetensors. The edge_base.safetensors can handle three types of edges, including Canny, HED, and Lineart.
[2024-10-31]: The code and models have been released!
[2024-10-04]: We have released the technical report of ControlAR. Code, models, and demos are coming soon!

Highlights

  • ControlAR explores an effective yet simple conditional decoding strategy for adding spatial controls to autoregressive models, e.g., LlamaGen, from a sequence perspective.

  • ControlAR supports arbitrary-resolution image generation with autoregressive models without hand-crafted special tokens or resolution-aware prompts.

TODO

Results

We provide both quantitative and qualitative comparisons with diffusion-based methods in the technical report!

Models

We released checkpoints of text-to-image ControlAR on different controls and settings, i.e. arbitrary-resolution generation.

AR Model Type Control encoder Control Arbitrary-Resolution Checkpoint
LlamaGen-XL t2i DINOv2-small Canny Edge ckpt
LlamaGen-XL t2i DINOv2-small Depth ckpt
LlamaGen-XL t2i DINOv2-small HED Edge ckpt
LlamaGen-XL t2i DINOv2-small Seg. Mask ckpt
LlamaGen-XL t2i DINOv2-base Edge (Canny, Hed, Lineart) ckpt
LlamaGen-XL t2i DINOv2-base Depth ckpt

Getting Started

Installation

conda create -n ControlAR python=3.10
git clone https://github.com/hustvl/ControlAR.git
cd ControlAR
pip install torch==2.1.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip3 install -U openmim 
mim install mmengine 
mim install "mmcv==2.1.0"
pip3 install "mmsegmentation>=1.0.0"
pip3 install mmdet
git clone https://github.com/open-mmlab/mmsegmentation.git

Pretrained Checkpoints for ControlAR

tokenizer text encoder LlamaGen-B LlamaGen-L LlamaGen-XL
vq_ds16_t2i.pt flan-t5-xl c2i_B_256.pt c2i_L_256.pt t2i_XL_512.pt

We recommend storing them in the following structures:

|---checkpoints
      |---t2i
            |---canny/canny_MR.safetensors
            |---hed/hed.safetensors
            |---depth/depth_MR.safetensors
            |---seg/seg_cocostuff.safetensors
            |---edge_base.safetensors
            |---depth_base.safetensors
      |---t5-ckpt
            |---flan-t5-xl
                  |---config.json
                  |---pytorch_model-00001-of-00002.bin
                  |---pytorch_model-00002-of-00002.bin
                  |---pytorch_model.bin.index.json
                  |---tokenizer.json
      |---vq
            |---vq_ds16_c2i.pt
            |---vq_ds16_t2i.pt
      |---llamagen (Only necessary for training)
            |---c2i_B_256.pt
            |---c2i_L_256.pt
            |---t2i_XL_stage2_512.pt

Demo

Coming soon...

Sample & Generation

1. Class-to-image genetation

python autoregressive/sample/sample_c2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt checkpoints/c2i/canny/LlamaGen-L.pt \
--gpt-model GPT-L --seed 0 --condition-type canny

2. Text-to-image generation

Generate an image using HED edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type hed --seed 0 --condition-path condition/example/t2i/multigen/eye.png

Generate an image using segmentation mask and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/cocostuff/doll.png \
--prompt 'A stuffed animal wearing a mask and a leash, sitting on a pink blanket'

3. Text-to-image generation with adjustable control strength

Generate an image using depth map and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/multigen/bird.jpg \
--prompt 'A bird made of blue crystal' \
--adapter-size base \
--control-strength 0.6

Generate an image using lineart edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/edge_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type lineart --seed 0 --condition-path condition/example/t2i/multigen/girl.jpg \
--prompt 'A girl with blue hair' \
--adapter-size base \
--control-strength 0.6

4. Arbitrary-resolution generation

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type depth --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0
python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/canny_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type canny --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

Preparing Datasets

We provide the dataset datails for evaluation and training. If you don't want to train ControlAR, just download the validation splits.

1. Class-to-image

  • Download ImageNet and save it to data/imagenet/data.

2. Text-to-image

  • Download ADE20K with caption(~7GB) and save the .parquet files to data/Captioned_ADE20K/data.
  • Download COCOStuff with caption( ~62GB) and save the .parquet files to data/Captioned_COCOStuff/data.
  • Download MultiGen-20M( ~1.22TB) and save the .parquet files to data/MultiGen20M/data.

3. Preprocessing datasets

To save training time, we adopt the tokenizer to pre-process the images with the text prompts.

  • ImageNet
bash scripts/autoregressive/extract_file_imagenet.sh \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--data-path data/imagenet/data/val \
--code-path data/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--ten-crop --crop-range 1.1 --image-size 256
  • ADE20k
bash scripts/autoregressive/extract_file_ade.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/Captioned_ADE20K/data --code-path data/Captioned_ADE20K/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation
  • COCOStuff
bash scripts/autoregressive/extract_file_cocostuff.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/Captioned_COCOStuff/data --code-path data/Captioned_COCOStuff/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation
  • MultiGen
bash scripts/autoregressive/extract_file_multigen.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/MultiGen20M/data --code-path data/MultiGen20M/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation

Testing and Evaluation

1. Class-to-image generation on ImageNet

bash scripts/autoregressive/test_c2i.sh \
--vq-ckpt ./checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt ./checkpoints/c2i/canny/LlamaGen-L.pt \
--code-path /path/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--gpt-model GPT-L --condition-type canny --get-condition-img True \
--sample-dir ./sample --save-image True
python create_npz.py --generated-images ./sample/imagenet/canny

Then download imagenet validation data which contains 10000 images, or you can use the whole validation data as reference data by running val.sh.

Calculate the FID score:

python evaluations/c2i/evaluator.py /path/imagenet/val/FID/VIRTUAL_imagenet256_labeled.npz \
sample/imagenet/canny.npz

2. Text-to-image generation on ADE20k

Download Mask2Former(weight) and save it to evaluations/.

Use this command to get 2000 images based on the segmentation mask:

bash scripts/autoregressive/test_t2i.sh --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_ade20k.pt \
--code-path data/Captioned_ADE20K/val --gpt-model GPT-XL --image-size 512 \
--sample-dir sample/ade20k --condition-type seg --seed 0

Calculate mIoU of the segmentation masks from the generated images:

python evaluations/ade20k_mIoU.py

3. Text-to-image generation on COCOStuff

Download DeepLabV3(weight) and save it to evaluations/.

Generate images using segmentation masks as condition controls:

bash scripts/autoregressive/test_t2i.sh --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.pt \
--code-path data/Captioned_COCOStuff/val --gpt-model GPT-XL --image-size 512 \
--sample-dir sample/cocostuff --condition-type seg --seed 0

Calculate mIoU of the segmentation masks from the generated images:

python evaluations/cocostuff_mIoU.py

4. Text-to-image generation on MultiGen-20M

We adopt generation with HED edges as the example:

Generate 5000 images based on the HED edges generated from validation images

bash scripts/autoregressive/test_t2i.sh --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors --code-path data/MultiGen20M/val \
--gpt-model GPT-XL --image-size 512 --sample-dir sample/multigen/hed \
--condition-type hed --seed 0

Evaluate the conditional consistency (SSIM):

python evaluations/hed_ssim.py

Calculate the FID score:

python evaluations/clean_fid.py --val-images data/MultiGen20M/val/image --generated-images sample/multigen/hed/visualization

Training ControlAR

1. Class-to-image (Canny)

bash scripts/autoregressive/train_c2i_canny.sh --cloud-save-path output \
--code-path data/imagenet/train/imagenet_code_c2i_flip_ten_crop \
--image-size 256 --gpt-model GPT-B --gpt-ckpt checkpoints/llamagen/c2i_B_256.pt

2. Text-to-image (Canny)

bash scripts/autoregressive/train_t2i_canny.sh 

Acknowledgments

The development of ControlAR is based on LlamaGen, ControlNet, ControlNet++, and AiM, and we sincerely thank the contributors for thoese great works!

Citation

If you find ControlAR is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{li2024controlar,
      title={ControlAR: Controllable Image Generation with Autoregressive Models}, 
      author={Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang},
      year={2024},
      eprint={2410.02705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.02705}, 
}