Show-o

One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie^1* Weijia Mao^1* Zechen Bai^1* David Junhao Zhang^1*
Weihao Wang² Kevin Qinghong Lin¹ Yuchao Gu¹ Zhijie Chen² Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

An overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.

Characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

News

[2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpaitning and extrapolation.

TODO

Release the inference code.
Release the training code (in the coming weeks).
Scale up the model size (based on LLaMA3) and increase the number of training data.

Getting Started

First, set up the environment:

pip3 install -r requirments.txt

Download model weight of a pre-trained LLM (Phi-1.5):

git lfs install
git clone https://huggingface.co/microsoft/phi-1_5

Download model weights of Show-o and put them to a directory in the structure below:

├── checkpoints/ 
|   ├── magvitv2.pth
|   ├── showo.bin
|   ├── showo_w_clip_vit.bin
|   ├── phi-1_5

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit.yaml \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?' \
pretrained_model_path=./checkpoints/showo_w_clip_vit.bin

Inference demo for Text-to-Image Generation and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=1.75 generation_timesteps=18 \
mode='t2i' pretrained_model_path=./checkpoints/showo.bin

Inference demo for Text-guided Inpainting and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=32 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Citation

To cite the paper and model, please use the below:

@article{xie2024showo,
  title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
  author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2408.12528},
  year={2024}
}

Acknowledgments

This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdatset. Thanks to all the authors for their great work.