/DEEM

DEEM: Official implementation of Diffusion models serve as the eyes of large language models for image perception.

Primary LanguagePythonApache License 2.0Apache-2.0

DEEM: Diffusion models serve as the eyes of large language models for image perception


Run Luo1,2*, Yunshui Li1,2*, Longze Chen1,2*, Wanwei He1,2, Ting-En Lin5, Ziqiang Liu1,2, Lei Zhang1,2
Zikai Song6, Xiaobo Xia4, Tongliang Liu4, Min Yang1,2🌟, Binyuan Hui3🌟

* Equal contribution 🌟 Corresponding author

1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group 4 The University of Sydney 5 Tsinghua University 6 HUST

Multi-Modal

[πŸ“– arXiv Paper] [πŸ“Š Dataset] [πŸ† Models]
DEEM is an exploration of using diffusion models as the eyes of multi-modal large language models, with the goal of eliminating potential biases in different visual encoders from a vision-centric perspective. We hope that DEEM can bring some thinking to the multimodal community, whether the unbiased diffusion model can replace the traditional visual encoder and become the second unified multimodal structure besides self-regression.

πŸ”₯ Update

  • [07/21]πŸ”₯DEEM is coming! We release the code, models, and data for DEEM!
  • [07/05]πŸ”₯DEEM is coming! We release the paper for DEEM!

πŸ‘€ Contents

πŸ“· Setup

Please follow the instructions below to install the required packages.

  1. Clone this repository
https://github.com/RainBowLuoCS/DEEM.git
  1. Install Package
conda create -n deem python=3.10 -y
conda activate deem
cd DEEM
pip install -r requirements.txt
# install `MultiScaleDeformableAttention` module
cd uni_interleaved/models/utils/ops
python setup.py install
  1. Download all pretrained model components from huggingface into the assets/ directory by running the following commands:
python scripts/download_models.py

πŸ” Model

Here are the pretrained weights on Stage 1 data only:

Model Diffusion Model Base LLM Vision Encoder Pretrain Data Download
MM-interleaved-7B (Baseline) SD 2.1 πŸ”₯ Vicuna-7B-v1.5 ConvNext-B MMC4+LAION ckpt
DEEM-7B SD 2.1 Vicuna-7B-v1.5 ConvNext-BπŸ”₯ MMC4+LAION ckpt
(DEEM+MM-interleaved)-7B SD 2.1 πŸ”₯ Vicuna-7B-v1.5 ConvNext-BπŸ”₯ MMC4+LAION ckpt

We provide all our fully finetuned models on Stage 2 and 3 data for DEEM:

Model Base LLM Vision Encoder Finetuning Data Download
DEEM-VQA 7B Vicuna-7B-v1.5 ConvNext-B LLaVA-665k+VQA+COCO ckpt
DEEM-MASK 7B Vicuna-7B-v1.5 ConvNext-B ReferCOCO+VG+PartData ckpt

πŸ’‘ Preparation

Dataset

Please follow MM-Interleaved , LLaVA and Osprey to prepare the corresponding images and data.

DEEM data structure

datasets
β”œβ”€β”€ laion
β”‚   β”œβ”€β”€ laion_annts
β”‚   β”‚   └── laion_shard_{0...1774}_v1.json
β”‚   └── laion_images
β”‚   β”‚   └── {00000..01174}.tar
β”œβ”€β”€ mmc4
β”‚   β”œβ”€β”€ mmc4_annts
β”‚   β”‚   └── docs_no_face_shard_{0..23098}_v3.jsonl
β”‚   └── mmc4_images
β”‚   β”‚   └── b9040a0dbb22.jpg
β”œβ”€β”€ aokvqa
β”‚   └── aokvqa_v1p0_train.json
β”œβ”€β”€ image2parag
β”‚   β”œβ”€β”€ paragraphs_coco.json
β”‚   β”œβ”€β”€ paragraphs_v1.json
β”‚   β”œβ”€β”€ test_split.json
β”‚   β”œβ”€β”€ train_split.json
β”‚   └── val_split.json
β”œβ”€β”€ coco
β”‚   β”œβ”€β”€ train2014 
β”‚   β”œβ”€β”€ train2017
β”‚   β”œβ”€β”€ val2014
β”‚   β”œβ”€β”€ val2017
β”‚   └── annotations
β”‚   β”‚   β”œβ”€β”€ coco_karpathy_train.json    
β”‚   β”‚   β”œβ”€β”€ coco_karpathy_val.json  
β”‚   β”‚   β”œβ”€β”€ captions_train2017.json          
β”‚   β”‚   β”œβ”€β”€ coco_karpathy_val_gt.json         
β”‚   β”‚   β”œβ”€β”€ coco_karpathy_test.json   
β”‚   β”‚   β”œβ”€β”€ instances_train2017
β”‚   β”‚   └── coco_karpathy_test_gt.json
β”œβ”€β”€ lncoco
β”‚   β”œβ”€β”€ coco_train_captions.jsonl
β”‚   └── coco_val_captions.jsonl
β”œβ”€β”€ flickr30k
β”‚   β”œβ”€β”€ flickr30k-images  
β”‚   β”œβ”€β”€ captiontobbox.json 
β”‚   β”œβ”€β”€ flickr30k_test1k.json
β”‚   β”œβ”€β”€ phrasetobbox.json
β”‚   └── groundedcaption.json
β”œβ”€β”€ gqa
β”‚   β”œβ”€β”€ images  
β”‚   β”œβ”€β”€ test_balanced_questions.json
β”‚   β”œβ”€β”€ train_balanced_questions.json
β”‚   └── testdev_balanced_questions.json 
β”œβ”€β”€ robustvqa
β”‚   β”œβ”€β”€ imagenet-r 
β”‚   β”œβ”€β”€ imagenet-a
β”‚   β”œβ”€β”€ imagenetv2
β”‚   └── robustvqa_test.json
β”œβ”€β”€ llava
β”‚   └── llava_v1_5_mix665k.json
β”œβ”€β”€ nocaps
β”‚   β”œβ”€β”€ val_imgs
β”‚   └── nocaps_val_4500_captions.json
β”œβ”€β”€ ocr_vqa
β”‚   β”œβ”€β”€ images 
β”‚   └── dataset.json
β”œβ”€β”€ okvqa
β”‚   β”œβ”€β”€ OpenEnded_mscoco_train2014_questions.json
β”‚   β”œβ”€β”€ OpenEnded_mscoco_val2014_questions.json 
β”‚   β”œβ”€β”€ mscoco_train2014_annotations.json
β”‚   └── mscoco_val2014_annotations.json
β”œβ”€β”€ part_data
β”‚   β”œβ”€β”€ test.json
β”‚   β”œβ”€β”€ train.json
β”‚   β”œβ”€β”€ val.json
β”‚   β”œβ”€β”€ partImagenet 
β”‚   β”‚   β”œβ”€β”€ train         
β”‚   β”‚   └── partImagenet_train_format.json
β”‚   └── pascal_part
β”‚   β”‚   β”œβ”€β”€ VOCdevkit   
β”‚   β”‚   └── pascalpart_train.json
β”œβ”€β”€ pope
β”‚   β”œβ”€β”€ coco_pope_adversarial.json
β”‚   β”œβ”€β”€ coco_pope_popular.json
β”‚   └── coco_pope_random.jso
β”œβ”€β”€ refcoco
β”‚   β”œβ”€β”€ refcoco+
β”‚   β”œβ”€β”€ refcoco
β”‚   β”œβ”€β”€ refcocog
β”‚   β”œβ”€β”€ finetune_refcoco+_train_with_mask.json 
β”‚   β”œβ”€β”€ finetune_refcoco_train_with_mask.json
β”‚   └── finetune_refcocog_val_with_mask.json
β”œβ”€β”€ textcaps
β”‚   β”œβ”€β”€ TextCaps_0.1_train.json
β”‚   └── TextCaps_0.1_val.json
β”œβ”€β”€ textvqa
β”‚   β”œβ”€β”€ train_images
β”‚   β”œβ”€β”€ TextVQA_0.5.1_train.json
β”‚   β”œβ”€β”€ textvqa_val_annotations.json
β”‚   β”œβ”€β”€ TextVQA_0.5.1_val.json
β”‚   └── textvqa_val_questions.json 
β”œβ”€β”€ vcr
β”‚   β”œβ”€β”€ vcr1images
β”‚   β”œβ”€β”€ test.jsonl
β”‚   β”œβ”€β”€ train.jsonl
β”‚   β”œβ”€β”€ val.jsonl
β”‚   └── textvqa_val_questions.json 
β”œβ”€β”€ vg
β”‚   β”œβ”€β”€ VG_100K
β”‚   β”œβ”€β”€ VG_100k_2
β”‚   β”œβ”€β”€ region_descriptions.json
β”‚   β”œβ”€β”€ image_data.json
β”‚   β”œβ”€β”€ vg_train_with_mask.json
β”‚   └── question_answers.json 
β”œβ”€β”€ visdial
β”‚   β”œβ”€β”€ VisualDialog_val2018
β”‚   β”œβ”€β”€ visdial_1.0_val_dense_annotations.json
β”‚   └── visdial_1.0_val.json  
β”œβ”€β”€ vizwiz
β”‚   β”œβ”€β”€ val
β”‚   β”œβ”€β”€ test.json
β”‚   β”œβ”€β”€ train.json
β”‚   └── val.json 
└── vqav2
β”‚   β”œβ”€β”€ v2_OpenEnded_mscoco_train2014_questions.json
β”‚   β”œβ”€β”€ v2_OpenEnded_mscoco_val2014_questions.json
β”‚   β”œβ”€β”€ v2_mscoco_train2014_annotations.json
β”‚   └── v2_mscoco_val2014_annotations.json 

We provide two very convenient scripts to download a large amount of pre-training data, laion and mmc4. You can download the pre-training data by running the following scripts. Due to network reasons, only 40% of the data can be obtained in the end.

python ./scripts/download_laion.py --mode=annt
python ./scripts/download_mmc4.py --mode=annt
python ./scripts/download_laion.py --mode=images
python ./scripts/download_mmc4.py --mode=images

You can find all dataset downloading and convertaion scripts or information in scripts/

Note that after downloading the mmc4 dataset, you need to use the following conversion script to convert it into the pre-trained format. Use the corresponding version according to the download method you choose.

python ./scripts/convert_mmc4_for_pretrain.py

You can use the robustvqa file we provide in datasets or regenerate it yourself using the script

python ./scripts/convert_imagenet_for_robustvqa_test.py

πŸ“ˆ Train

Click to see the detail model structure

DEEM training consists of three stages: (1) image-text alignment pre-training; (2) image-text supervised fine-tuning ; and (3) mask-text supervised fine-tuning.

DEEM is trained on 32 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training and evaluation.

bash scripts/train.sh

πŸ“ˆ Evaluation

We perform evaluation on several image-based benchmarks. Please see our paper for the more details.

If you want to evaluate the model on image-based benchmarks, please use the evaluation scripts for automatic evaluation.

bash scripts/evaluate.sh

For gqa we need to use the following script separately for evaluation.

unzip -d ./uni_interleaved/utils/gqa_metrics_src/ ./uni_interleaved/utils/gqa_metrics_src/train_choices.zip
python ./uni_interleaved/utils/gqa_eval.py

πŸ‘€ Examples

We provide some examples in this section. More examples can be found in our paper

Click to expand more examples

Schedule

  • Release training & evaluation code

  • Release stage 1 image-text alignment pre-training model weights

  • Release stage 2 image-text sft model weights

  • Release stage 3 mask-text sft model weights

Citation

If you find this repo useful for your research, please consider citing the paper

@article{luo2024deem,
  title={Deem: Diffusion models serve as the eyes of large language models for image perception},
  author={Luo, Run and Li, Yunshui and Chen, Longze and He, Wanwei and Lin, Ting-En and Liu, Ziqiang and Zhang, Lei and Song, Zikai and Xia, Xiaobo and Liu, Tongliang and others},
  journal={arXiv preprint arXiv:2405.15232},
  year={2024}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.