
Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Paper | Project Page


Download checkpoints for stage1, stage2, and the final model.

mkdir kosmosg_checkpoints
cd kosmosg_checkpoints
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvVmlULUwtMTQtc2QucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O ViT-L-14-sd.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTEucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage1.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTIucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage2.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9maW5hbC5wdD9zdj0yMDIzLTAxLTAzJnN0PTIwMjQtMDQtMTBUMTMlM0ExMSUzQTQ0WiZzZT0yMDUwLTA0LTExVDEzJTNBMTElM0EwMFomc3I9YyZzcD1yJnNpZz00Y1hKSWpWUlpISUJXcUhqUGdEbiUyRjAxb2N6cERXWGlwbVBDVWszWjh2YlElM0Q=" | base64 --decode)
wget -O checkpoint_final.pt $DLINK


Using Docker Image [Recommended]

You can use our built Docker Image

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

You can also start with NVIDIA Official Docker Image, and install all dependencies manually.

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash
apt-get install -y libsm6 libxext6 libxrender-dev
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh

Using Base Environment

Make sure you have Pytorch 1.13.0 and nvcc 11.x installed.

git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh


If you would like to host a local Gradio demo, run the following command after setup:

bash runapp.sh

Be sure to adjust the guidance scale if you find the default one leads to over-saturated images.


Preparing dataset

Refer to this guide to prepare the dataset.

Train script

After preparing the data, run the following command to train the model. Be sure to change the directories in the script to your own. For the image decoder aligning stage:

bash runalign.sh

For the instruction tuning stage:

bash runtrain.sh


FID score on COCO (2014) val set

Download and unzip the COCO (2014) val set:

mkdir coco
cd coco
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip

Specify the cfg in sample_kosmosg_coco.py and run the script to evaluate:

bash runeval_coco.sh

DINO score, CLIP-I score and CLIP-T score on DreamBench

Download DreamBench:

mkdir dreambench
cd dreambench
git clone https://github.com/google/dreambooth.git

We keep only one image for each entity as described in our paper.

bash scripts/remove_dreambench_multiimg.sh /path/to/dreambench/dreambooth/dataset

Specify the cfg in sample_kosmosg_dreambench.py and run the script to evaluate:

bash runeval_dreambench.sh


Kosmos-G is purely a research project. Currently, we have no plans to incorporate Kosmos-G into a product or expand access to the public. We will also put Microsoft AI principles into practice when further developing the models.

In our research paper, we account for the ethical concerns associated with text-to-image research. To mitigate issues associated with training data, we have implemented a rigorous filtering process to purge our training data of inappropriate content, such as explicit imagery and offensive language, to minimize the likelihood of generating inappropriate content.


This repository is built using torchscale, fairseq, openclip. We thank the authors of Nerfies that kindly open sourced the template of the project page.


