Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Official implementation of Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models by adapting pretrained ControlNets.

Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal

CTRL-Adapter is an efficient and versatile framework for adding diverse spatial controls to any image or video diffusion model. It supports a variety of useful applications, including video control, video control with multiple conditions, video control with sparse frame conditions, image control, zero-shot transfer to unseen conditions, and video editing.

🔥 News

May. 26, 2024. Check our new arXiv-v2 for exciting new additions to Ctrl-Adapter!
- Support for DiT-based backbones (Latte, PixArt-α)
- Fine-grained patch-level MoE router for multi-control composition
- Downstream tasks beyond spatial control (video editing, video style transfer, text-guided motion control)
Apr. 30, 2024. Training code released now! It's time to train Ctrl-Adapter on your desired backbone! 🚀🚀
Apr. 29, 2024. SDXL, I2VGen-XL, and SVD inference code and checkpoints are all released!

🔧 Setup

Environment Setup

If you only need to perform inference with our code, please install from requirements_inference.txt. To make our codebase easy to use, the primary libraries that need to be installed are Torch, Diffusers, and Transformers. Specific versions of these libraries are not required; the default versions should work fine :)

If you are planning to conduct training, please install from requirements_train.txt instead, which contains more dependent libraries needed.

conda create -n ctrl-adapter python==3.10
conda activate ctrl-adapter
pip install -r requirements_inference.txt # install from this if you only need to perform inference
pip install -r requirements_train.txt # install from this if you plan to do some training

Here we list several questions that we believe important when you start using this

🔮 Inference

We provde model checkpoints and inference scripts for Ctrl-Adapter trained on SDXL, I2VGen-XL, and SVD. All inference scripts are put under ./inference_scripts.

📌 Notice Before You Begin

Please note that there is usually no single model that excels at generating images/videos for all motion styles across various control conditions.

Different image/video generation backbones may perform better with specific types of motion. For instance, we have observed that SVD excels at slide motions, while it generally performs worse than I2VGen-XL with complex motions (this is consistent wtih the findings in DynamiCrafter). Additionally, using different control conditions can lead to significantly different results in the generated images/videos, and some control conditions may be more informative than others for certain types of motion.

📌 Inference Data Structure

We put some sample images/frames for inference under the folder ./assets/evaluation. You can add your custom examples following the same file structure illustrated below.

For model inference, we support two options:

If you already have condition image/frames extracted from some image/video, you can use inference (w/ extracted condition).

./assets/evaluation/images
    ├── depth
    │   ├── anime_corgi.png
    ├── raw_input
    │   ├── anime_corgi.png
    ├── captions.json

./assets/evaluation/frames
    ├── depth
    │   ├── newspaper_cat
    │   │   ├── 00000.png
    │   │   ├── 00001.png
    │   │   ...
    │   │   ├── 00015.png
    ├── raw_input
    │   ├── newspaper_cat
    │   │   ├── 00000.png # only the 1st frame is needed for I2V models
    ├── captions.json

If you haven't extracted control conditions and only have the raw image/frames, you can use inference (w/o extracted condition). In this way, our code can automatically extract the control conditions from the input image/frames and then generate corresponding image/video.

./assets/evaluation/images
    ├── raw_input
    │   ├── anime_corgi.png
    ├── captions.json

./assets/evaluation/frames
    ├── raw_input
    │   ├── newspaper_cat
    │   │   ├── 00000.png
    │   │   ├── 00001.png
    │   │   ...
    │   │   ├── 00015.png
    ├── captions.json

📌 Run Inference Scripts

Here is a sample command to run inference on SDXL with depth map as control (w/ extracted condition).

sh inference_scripts/sdxl/sdxl_inference_depth.sh

⚠️ --control_guidance_end: this is the most important parameter that balances generated image/video quality with control strength. If you notice the generated image/video does not follow the spatial control well, you can increase this value; and if you notice the generated image/video quality is not good because the spatial control is too strong, you can decrease this value. Detailed discussion of control strength via this parameter is shown in our paper.

We list the inference scripts for different tasks mentioned in our paper as follows ⬇️

Controllable Image Generation

SDXL

Control Conditions	Checkpoints	Inference (w/ extracted condition)	Inference (w/o extracted condition)
Depth Map	HF link	command	command
Canny Edge	HF link	command	command
Soft Edge	HF link	command	command
Normal Map	HF link	command	command
Segmentation	HF link	command	command
Scribble	HF link	command	command
Lineart	HF link	command	command

Controllable Video Generation

I2VGen-XL

Control Conditions	Checkpoints	Inference (w/ extracted condition)	Inference (w/o extracted condition)
Depth Map	HF link	command	command
Canny Edge	HF link	command	command
Soft Edge	HF link	command	command

SVD

Control Conditions	Checkpoints	Inference (w/ extracted condition)	Inference (w/o extracted condition)
Depth Map	HF link	command	command
Canny Edge	HF link	command	command
Soft Edge	HF link	command	command

Video Generation with Multi-Condition Control

We currently implemented multi-condition control on I2VGen-XL. The following checkpoint are trained on 7 control conditions, including depth, canny, normal, softedge, segmentation, lineart, and openpose. Here are the sample inference scripts that uses depth, canny, segmentation, and openpose as control conditions.

Adapter Checkpoint	Router Checkpoint	Inference (w/ extracted condition)	Inference (w/o extracted condition)
HF link	HF link	command	command

Video Generation with Sparse Control

Here we provide a sample inference script that uses user scribbles as condition, and 4 out of 16 frames for sparse control.

Control Conditions	Checkpoint	Inference (w/ extracted condition)
Scribbles	HF link	command

🚅 How To Train

🎉 To make our method reproducible and adaptable to new backbones, we have released all of our training code :)

You can find detailed training guideline for Ctrl-Adapter here!

📝 TODO List

Release environment setup, inference code, and model checkpoints.
Release training code.
Training guideline to adapt our Ctrl-Adapter to new image/video diffusion models.
Ctrl-Adapter + DiT-based image/video generation backbones (Latte, PixArt-α). (WIP)
Code for video editing and text-guided motion control. (WIP)
Release evaluation code.

💗 Please let us know in the issues or PRs if you're interested in any relevant backbones or down-stream tasks that can be implemented by our Ctrl-Adapter framework! Welcome to collaborate and contribute!

📚 BibTeX

🌟 If you find our project useful in your research or application development, citing our paper would be the best support for us!

@misc{lin2024ctrladapter,
      title={Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model}, 
      author={Han Lin and Jaemin Cho and Abhay Zala and Mohit Bansal},
      year={2024},
      eprint={2404.09967},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

The development of Ctrl-Adapter has been greatly inspired by the following amazing works and teams:

We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.

HL-hanlin/Ctrl-Adapter