Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

I2ML, Nanjing University of Aeronautics and Astronautics

We propose Magnet, a training-free approach that improves attribute binding by manipulating object embeddings, enhancing disentanglement within the textual space.

🌟 Key Features

In-depth analysis and exploration of the CLIP text encoder, highlighting the context issue of padding embeddings;
Improve text alignment by applying positive and negative binding vectors on object embeddings, with negligible cost.
Plug-and-play to various T2I models and controlling methods, e.g., ControlNet.

⚙️ Setup and Usage

conda create --name magnet python=3.11
conda activate magnet

# Install requirements
pip install -r requirements.txt

If you are curious about how different types of text embedding influence generation, we recommend running (1) visualize_attribute_bias.ipynb to explore the attribute bias on different objects, (2) emb_swap_cases.py to reproduce the swapping experiment.

Download the pre-trained SD V1.4, SD V1.5 (unfortunately now 404), SD V2, SD V2.1, or SDXL.

# Run magnet on SD V1.4
python run.py --sd_path path-to-stable-diffusion-v1-4 --magnet_path bank/candidates_1_4.pt --N 2 --run_sd

# Run magnet on SDXL
python run.py --sd_path path-to-stable-diffusion-xl --magnet_path bank/candidates_sdxl.pt --N 2 --run_sd

# Remove the "run_sd" argument if you don't want the standard model run

You can also try ControlNet conditioned on Depth estimation DPT-Large.

# Run magnet with ControlNet
python run_with_controlnet.py --sd_path path-to-stable-diffusion-v1-5 --magnet_path bank/candidates_1_5.pt --N 2 --controlnet_path path-to-sd-controlnet-depth --dpt_path path-to-dpt-large --run_sd

We also provide run_vanilla_pipeline.py to use magnet via the prompt_embeds argument in the standard StableDiffusionPipeline.

Demos of cross-attention visualization are in visualize_attention.ipynb.

Feel free to explore Magnet and leave any questions in this repo!

😺 Examples

Compare to state-of-the-art approaches:

Integrate Magnet into other T2I pipelines and T2I controlling modules:

😿 Limitations

Magnet's performance is largely dependent on the pre-trained T2I model. It may not provide meaningful modifications due to the limited power of text-based manipulation alone. You can manually adjust the prompt, seed, or hyperparameters, and combine other techniques to get a better result if you are not satisfied with the output.

🌊 Acknowledgements

Most prompts are based on datasets obtained from Structure Diffusion. We also refer to Prompt-to-Prompt and PixArt.

TODO

Release the source code and model.
Extend to more T2I models.
Extend to controlling approaches.