Run Luo1,2*, Yunshui Li1,2*, Longze Chen1,2*, Wanwei He1,2, Ting-En Lin5, Ziqiang Liu1,2, Lei Zhang1,2
Zikai Song6, Xiaobo Xia4, Tongliang Liu4, Min Yang1,2π, Binyuan Hui3π
* Equal contribution π Corresponding author
1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group
4 The University of Sydney
5 Tsinghua University
6 HUST
- [07/21]π₯DEEM is coming! We release the code, models, and data for DEEM!
- [07/05]π₯DEEM is coming! We release the paper for DEEM!
Please follow the instructions below to install the required packages.
- Clone this repository
https://github.com/RainBowLuoCS/DEEM.git
- Install Package
conda create -n deem python=3.10 -y
conda activate deem
cd DEEM
pip install -r requirements.txt
# install `MultiScaleDeformableAttention` module
cd uni_interleaved/models/utils/ops
python setup.py install
- Download all pretrained model components from huggingface into the
assets/
directory by running the following commands:
python scripts/download_models.py
Here are the pretrained weights on Stage 1 data only:
Model | Diffusion Model | Base LLM | Vision Encoder | Pretrain Data | Download |
---|---|---|---|---|---|
MM-interleaved-7B (Baseline) | SD 2.1 π₯ | Vicuna-7B-v1.5 | ConvNext-B | MMC4+LAION | ckpt |
DEEM-7B | SD 2.1 | Vicuna-7B-v1.5 | ConvNext-Bπ₯ | MMC4+LAION | ckpt |
(DEEM+MM-interleaved)-7B | SD 2.1 π₯ | Vicuna-7B-v1.5 | ConvNext-Bπ₯ | MMC4+LAION | ckpt |
We provide all our fully finetuned models on Stage 2 and 3 data for DEEM:
Model | Base LLM | Vision Encoder | Finetuning Data | Download |
---|---|---|---|---|
DEEM-VQA 7B | Vicuna-7B-v1.5 | ConvNext-B | LLaVA-665k+VQA+COCO | ckpt |
DEEM-MASK 7B | Vicuna-7B-v1.5 | ConvNext-B | ReferCOCO+VG+PartData | ckpt |
Please follow MM-Interleaved , LLaVA and Osprey to prepare the corresponding images and data.
datasets
βββ laion
β βββ laion_annts
β β βββ laion_shard_{0...1774}_v1.json
β βββ laion_images
β β βββ {00000..01174}.tar
βββ mmc4
β βββ mmc4_annts
β β βββ docs_no_face_shard_{0..23098}_v3.jsonl
β βββ mmc4_images
β β βββ b9040a0dbb22.jpg
βββ aokvqa
β βββ aokvqa_v1p0_train.json
βββ image2parag
β βββ paragraphs_coco.json
β βββ paragraphs_v1.json
β βββ test_split.json
β βββ train_split.json
β βββ val_split.json
βββ coco
β βββ train2014
β βββ train2017
β βββ val2014
β βββ val2017
β βββ annotations
β β βββ coco_karpathy_train.json
β β βββ coco_karpathy_val.json
β β βββ captions_train2017.json
β β βββ coco_karpathy_val_gt.json
β β βββ coco_karpathy_test.json
β β βββ instances_train2017
β β βββ coco_karpathy_test_gt.json
βββ lncoco
β βββ coco_train_captions.jsonl
β βββ coco_val_captions.jsonl
βββ flickr30k
β βββ flickr30k-images
β βββ captiontobbox.json
β βββ flickr30k_test1k.json
β βββ phrasetobbox.json
β βββ groundedcaption.json
βββ gqa
β βββ images
β βββ test_balanced_questions.json
β βββ train_balanced_questions.json
β βββ testdev_balanced_questions.json
βββ robustvqa
β βββ imagenet-r
β βββ imagenet-a
β βββ imagenetv2
β βββ robustvqa_test.json
βββ llava
β βββ llava_v1_5_mix665k.json
βββ nocaps
β βββ val_imgs
β βββ nocaps_val_4500_captions.json
βββ ocr_vqa
β βββ images
β βββ dataset.json
βββ okvqa
β βββ OpenEnded_mscoco_train2014_questions.json
β βββ OpenEnded_mscoco_val2014_questions.json
β βββ mscoco_train2014_annotations.json
β βββ mscoco_val2014_annotations.json
βββ part_data
β βββ test.json
β βββ train.json
β βββ val.json
β βββ partImagenet
β β βββ train
β β βββ partImagenet_train_format.json
β βββ pascal_part
β β βββ VOCdevkit
β β βββ pascalpart_train.json
βββ pope
β βββ coco_pope_adversarial.json
β βββ coco_pope_popular.json
β βββ coco_pope_random.jso
βββ refcoco
β βββ refcoco+
β βββ refcoco
β βββ refcocog
β βββ finetune_refcoco+_train_with_mask.json
β βββ finetune_refcoco_train_with_mask.json
β βββ finetune_refcocog_val_with_mask.json
βββ textcaps
β βββ TextCaps_0.1_train.json
β βββ TextCaps_0.1_val.json
βββ textvqa
β βββ train_images
β βββ TextVQA_0.5.1_train.json
β βββ textvqa_val_annotations.json
β βββ TextVQA_0.5.1_val.json
β βββ textvqa_val_questions.json
βββ vcr
β βββ vcr1images
β βββ test.jsonl
β βββ train.jsonl
β βββ val.jsonl
β βββ textvqa_val_questions.json
βββ vg
β βββ VG_100K
β βββ VG_100k_2
β βββ region_descriptions.json
β βββ image_data.json
β βββ vg_train_with_mask.json
β βββ question_answers.json
βββ visdial
β βββ VisualDialog_val2018
β βββ visdial_1.0_val_dense_annotations.json
β βββ visdial_1.0_val.json
βββ vizwiz
β βββ val
β βββ test.json
β βββ train.json
β βββ val.json
βββ vqav2
β βββ v2_OpenEnded_mscoco_train2014_questions.json
β βββ v2_OpenEnded_mscoco_val2014_questions.json
β βββ v2_mscoco_train2014_annotations.json
β βββ v2_mscoco_val2014_annotations.json
We provide two very convenient scripts to download a large amount of pre-training data, laion and mmc4. You can download the pre-training data by running the following scripts. Due to network reasons, only 40% of the data can be obtained in the end.
python ./scripts/download_laion.py --mode=annt
python ./scripts/download_mmc4.py --mode=annt
python ./scripts/download_laion.py --mode=images
python ./scripts/download_mmc4.py --mode=images
You can find all dataset downloading and convertaion scripts or information in scripts/
Note that after downloading the mmc4 dataset, you need to use the following conversion script to convert it into the pre-trained format. Use the corresponding version according to the download method you choose.
python ./scripts/convert_mmc4_for_pretrain.py
You can use the robustvqa file we provide in datasets or regenerate it yourself using the script
python ./scripts/convert_imagenet_for_robustvqa_test.py
DEEM training consists of three stages: (1) image-text alignment pre-training; (2) image-text supervised fine-tuning ; and (3) mask-text supervised fine-tuning.
DEEM is trained on 32 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Please make sure you download and organize the data following Preparation before training and evaluation.
bash scripts/train.sh
We perform evaluation on several image-based benchmarks. Please see our paper for the more details.
If you want to evaluate the model on image-based benchmarks, please use the evaluation scripts for automatic evaluation.
bash scripts/evaluate.sh
For gqa we need to use the following script separately for evaluation.
unzip -d ./uni_interleaved/utils/gqa_metrics_src/ ./uni_interleaved/utils/gqa_metrics_src/train_choices.zip
python ./uni_interleaved/utils/gqa_eval.py
We provide some examples in this section. More examples can be found in our paper
-
Release training & evaluation code
-
Release stage 1 image-text alignment pre-training model weights
-
Release stage 2 image-text sft model weights
-
Release stage 3 mask-text sft model weights
If you find this repo useful for your research, please consider citing the paper
@article{luo2024deem,
title={Deem: Diffusion models serve as the eyes of large language models for image perception},
author={Luo, Run and Li, Yunshui and Chen, Longze and He, Wanwei and Lin, Ting-En and Liu, Ziqiang and Zhang, Lei and Song, Zikai and Xia, Xiaobo and Liu, Tongliang and others},
journal={arXiv preprint arXiv:2405.15232},
year={2024}
}
We would like to thank the following repos for their great work:
- This work is built upon theMM-Interleaved
- This work utilizes LLMs from , Vicuna
- This work utilizes the great work from OpenFlamingo, transformers, diffusers, LLaMA, CLIP, BLIP, ViT-Adapter and Osprey.
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.