ControlAR

Controllable Image Generation with Autoregressive Models

Zongming Li^1,*, Tianheng Cheng^1,*, Shoufa Chen², Peize Sun², Haocheng Shen³,Longjin Ran³, Xiaoxin Chen³, Wenyu Liu¹, Xinggang Wang^1,📧

¹ Huazhong University of Science and Technology, ² The University of Hong Kong ³ vivo AI Lab

(* equal contribution, 📧 corresponding author)

News

[2024-10-31]: The code and models have been released!
[2024-10-04]: We have released the technical report of ControlAR. Code, models, and demos are coming soon!

Highlights

ControlAR explores an effective yet simple conditional decoding strategy for adding spatial controls to autoregressive models, e.g., LlamaGen, from a sequence perspective.
ControlAR supports arbitrary-resolution image generation with autoregressive models without hand-crafted special tokens or resolution-aware prompts.

TODO

release code & models.
release demo code and HuggingFace demo.

Results

We provide both quantitative and qualitative comparisons with diffusion-based methods in the technical report!

Models

We released checkpoints of text-to-image ControlAR on different controls and settings, i.e. arbitrary-resolution generation.

AR Model	Type	Control	Arbitrary-Resolution	Checkpoint
LlamaGen-XL	t2i	Canny Edge	✅	ckpt
LlamaGen-XL	t2i	Depth	✅	ckpt
LlamaGen-XL	t2i	HED Edge	❌	ckpt
LlamaGen-XL	t2i	Seg. Mask	❌	ckpt

Getting Started

Installation

conda create -n ControlAR python=3.10\
git clone https://github.com/hustvl/ControlAR.git\
cd ControlAR\
pip install torch==2.1.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118\
pip install -r requirements.txt\
pip3 install -U openmim \
mim install mmengine \
mim install "mmcv==2.1.0"\
pip3 install "mmsegmentation>=1.0.0"\
pip3 install mmdet\
git clone https://github.com/open-mmlab/mmsegmentation.git

Pretrained Checkpoints for ControlAR

tokenizer	text encoder	LlamaGen-B	LlamaGen-L	LlamaGen-XL
vq_ds16_t2i.pt	flan-t5-xl	c2i_B_256.pt	c2i_L_256.pt	t2i_XL_512.pt

We recommend storing them in the following structures:

|---checkpoints
      |---t2i
      |---canny/canny_MR.safetensors
      |---hed/hed.safetensors
      |---depth/depth_MR.safetensors
      |---seg/seg_cocostuff.safetensors
      |---t5-ckpt
      |---flan-t5-xl
            |---config.json
            |---pytorch_model-00001-of-00002.bin
            |---pytorch_model-00002-of-00002.bin
            |---pytorch_model.bin.index.json
            |---tokenizer.json
      |---vq
      |---vq_ds16_c2i.pt
      |---vq_ds16_t2i.pt
      |---llamagen (Only necessary for training)
      |---c2i_B_256.pt
      |---c2i_L_256.pt
      |---t2i_XL_stage2_512.pt

Demo

Coming soon...

Sample & Generation

1. Class-to-image genetation

python autoregressive/sample/sample_c2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt checkpoints/c2i/canny/LlamaGen-L.pt \
--gpt-model GPT-L --seed 0 --condition-type canny

2. Text-to-image generation

Generate an image using HED edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type hed --seed 0 --condition-path condition/example/t2i/multigen/eye.png

Generate an image using segmentation mask and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/cocostuff/doll.png \
--prompt 'A stuffed animal wearing a mask and a leash, sitting on a pink blanket'

3. Arbitrary-resolution generation

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type depth --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/canny_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type canny --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

Preparing Datasets

We provide the dataset datails for evaluation and training. If you don't want to train ControlAR, just download the validation splits.

1. Class-to-image

Download ImageNet and save it to data/imagenet/data.

2. Text-to-image

Download ADE20K with caption(~7GB) and save the .parquet files to data/Captioned_ADE20K/data.
Download COCOStuff with caption( ~62GB) and save the .parquet files to data/Captioned_COCOStuff/data.
Download MultiGen-20M( ~1.22TB) and save the .parquet files to data/MultiGen20M/data.

3. Preprocessing datasets

To save training time, we adopt the tokenizer to pre-process the images with the text prompts.

ImageNet

bash scripts/autoregressive/extract_file_imagenet.sh \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--data-path data/imagenet/data/val \
--code-path data/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--ten-crop --crop-range 1.1 --image-size 256

ADE20k

bash scripts/autoregressive/extract_file_ade.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/Captioned_ADE20K/data --code-path data/Captioned_ADE20K/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation

COCOStuff

bash scripts/autoregressive/extract_file_cocostuff.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/Captioned_COCOStuff/data --code-path data/Captioned_COCOStuff/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation

MultiGen

bash scripts/autoregressive/extract_file_multigen.sh \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--data-path data/MultiGen20M/data --code-path data/MultiGen20M/val \
--ten-crop --crop-range 1.1 --image-size 512 --split validation

Testing and Evaluation

1. Class-to-image generation on ImageNet

bash scripts/autoregressive/test_c2i.sh \
--vq-ckpt ./checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt ./checkpoints/c2i/canny/LlamaGen-L.pt \
--code-path /path/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--gpt-model GPT-L --condition-type canny --get-condition-img True \
--sample-dir ./sample --save-image True

python create_npz.py --generated-images ./sample/imagenet/canny

Then download imagenet validation data which contains 10000 images, or you can use the whole validation data as reference data by running val.sh.

Calculate the FID score:

python evaluations/c2i/evaluator.py /path/imagenet/val/FID/VIRTUAL_imagenet256_labeled.npz \
sample/imagenet/canny.npz

2. Text-to-image generation on ADE20k

Download Mask2Former(weight) and save it to evaluations/.

Use this command to get 2000 images based on the segmentation mask:

bash scripts/autoregressive/test_t2i.sh --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_ade20k.pt \
--code-path data/Captioned_ADE20K/val --gpt-model GPT-XL --image-size 512 \
--sample-dir sample/ade20k --condition-type seg --seed 0

Calculate mIoU of the segmentation masks from the generated images:

python evaluations/ade20k_mIoU.py

3. Text-to-image generation on COCOStuff

Download DeepLabV3(weight) and save it to evaluations/.

Generate images using segmentation masks as condition controls:

bash scripts/autoregressive/test_t2i.sh --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.pt \
--code-path data/Captioned_COCOStuff/val --gpt-model GPT-XL --image-size 512 \
--sample-dir sample/cocostuff --condition-type seg --seed 0