/BiGR

The official code for "BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities"

Primary LanguagePythonMIT LicenseMIT

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Project Page arXiv Hugging Face Models Open In Colab License

This is the official PyTorch code for the paper:

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
Shaozhe Hao1, Xuantong Liu2, Xianbiao Qi3*, Shihao Zhao1, Bojia Zi4, Rong Xiao3, Kai Han1†, Kwan-Yee K. Wong1
1The University of Hong Kong   2Hong Kong University of Science and Technology
3Intellifusion   4The Chinese University of Hong Kong
(*: Project lead; †: Corresponding authors)

[Project page] [arXiv] [Colab]

TL;DR: We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities.

📢 News

🌟 We are training BiGR with REPA, a representation alignment regularization that enhances both generation and representation performance in DiT/SiT.

⚙️ Setup

You can simply install the environment with the file environment.yml by:

conda env create -f environment.yml
conda activate BiGR

🔗 Download

Please first download the pretrained weights for tokenizers and BiGR models to run our tests.

Binary Autoencoder

We train Binary Autoencoder (B-AE) by adapting the official code of Binary Latent Diffusion. We provide pretrained weights for different configurations.

256x256 resolution

B-AE Size Checkpoint
d24 332M download
d32 332M download

512x512 resolution

B-AE Size Checkpoint
d32-512 315M download

BiGR models ✨

We provide pretrained weights for BiGR models in various sizes.

256x256 resolution

Model B-AE Size Checkpoint
BiGR-L-d24 d24 1.35G download
BiGR-XL-d24 d24 3.20G download
BiGR-XXL-d24 d24 5.92G download
BiGR-XXL-d32 d32 5.92G download

512x512 resolution

Model B-AE Size Checkpoint
BiGR-L-d32-res512 d32-res512 1.49G download

🚀 Image generation

We provide the sample script for 256x256 image generation in script/sample.sh.

bash script/sample.sh

Please specify the code dimension $CODE, your B-AE checkpoint path $CKPT_BAE, and your BiGR checkpoint path $CKPT_BIGR.

You may also want to try different settings of the CFG scale $CFG, the number of sample iterations $ITER, and the gumbel temperature $GUMBEL. We recommend using small gumbel temperature for better visual quality (e.g., GUMBEL=0). You can increase gumbel temperature to enhance generation diversity.

You can generate 512x512 images using script/sample_512.sh. Note that you need to specify the corresponding 512x512 tokenizers and models.

bash script/sample_512.sh

💡 Zero-shot applications

BiGR supports various zero-shot generalized applications, without the need for task-specific structural changes or parameter fine-tuning.

You can easily download testing images and run our scripts to get started. Feel free to play with your own images.

Inpainting & Outpainting

bash script/app_inpaint.sh
bash script/app_outpaint.sh

You need to save the source image and the mask in the same folder, with the image as a *.JPEG file and the mask as a *.png file. You can then specify the source image path $IMG.

You can customize masks using this gradio demo.

Class-conditional editting

bash script/app_edit.sh

In addition to the source image path $IMG, you also need to give a class index $CLS for editing.

Class interpolation

bash script/app_interpolate.sh

You need to specify two class indices $CLS1 and $CLS2.

Image enrichment

bash script/app_enrich.sh

You need to specify the source image path $IMG.

💻 Train

You can train BiGR yourself by running:

bash script/train.sh

You need to specify the ImageNet-1K dataset path --data-path.

We train L/XL-sized models using 8 A800 GPUs and XXL-sized models using 32 A800 GPUs on 4 nodes.

💐 Acknowledgement

This project builds on Diffusion Transformer, Binary Latent Diffusion, and LlamaGen. We thank these great works!

📖 Citation

If you use this code in your research, please consider citing our paper:

@misc{hao2024bigr,
    title={Bi{GR}: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities}, 
    author={Shaozhe Hao and Xuantong Liu and Xianbiao Qi and Shihao Zhao and Bojia Zi and Rong Xiao and Kai Han and Kwan-Yee~K. Wong},
    year={2024},
}