/MMEvol

πŸ”₯πŸ”₯πŸ”₯Code for "Empowering Multimodal Large Language Models with Evol-Instruct"

Primary LanguageJupyter Notebook

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct


Run Luo1,2*, Haonan Zhang3*, Longze Chen1,2*, Ting-En Lin3*,
Xiong Liu3, Yuchuan Wu3, Min Yang1,2🌟, Yongbin Li3🌟,
Minzheng Wang2, Pengpeng Zeng4, Lianli Gao5, Heng Tao Shen4,
Yunshui Li1,2, Xiaobo Xia6, FeiHuang3, Jingkuan Song4🌟,

* Equal contribution 🌟 Corresponding author

1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group 4 Tongji University 5 Independent Researcher 6 The University of Sydney

Multi-Modal

[πŸ“– arXiv Paper] [πŸ“Š Dataset] [πŸ† Models]
MMEvol is the pioneering method that successfully incorporates Evol-Instruct into the multimodal domain, enhancing the diversity and complexity of multimodal instruction data. Unlike previous methods such as VILA2, MIMIC-IT, and MMInstruct, it achieves iterative evolution in an elegant, simple, and fully automated manner, transcending traditional limits on data complexity and diversity. MMEvol imposes no restrictions on data format, task type, or intricate processing, allowing for rapid self-iterative evolution of limited image instruction data to produce exceptionally high-quality multimodal data. This empowers multimodal models with enhanced capabilities. Additionally, it can be seamlessly combined with other data flow-driven methods like VILA2, MIMIC-IT, and MMInstruct for more robust data construction. We invite everyone to experience it now!

πŸ”₯ Update

  • [11/10]πŸ”₯MMEvol is coming! We release the code, models, and data for MMEvol!
  • [09/09]πŸ”₯MMEvol is coming! We release the paper for MMEvol!

πŸ‘€ Contents

πŸ“· Setup

Please follow the instructions below to install the required packages.

  1. Clone this repository
git clone https://github.com/RainBowLuoCS/MMEvol.git
cd MMEvol
  1. Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

πŸ“· Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Hyperparameter Global Batch Size LLM lr Projector lr Vision Tower lr Epochs Max length Weight decay
PT 256 0 1e-3 0 1 4096 0
FT 128 2e-5 2e-5 2e-6 1 4096 0

πŸ” Model

Here are the pretrained weights and instruction tuning weights

Model Pretrained Projector Base LLM PT Data IT Data Download
MMEvol-Qwen2-7B mm_projector Qwen2-7B LLaVA-Pretrain MMEvol ckpt
MMEvol-LLaMA3-8B mm_projector LLaMA3-8B LLaVA-Pretrain MMEvol ckpt

Performance

VLMEvalKit Support (OpenCompass)

Model MME_C MMStar HallBench MathVista_mini MMMU_val AI2D POPE BLINK RWQA
MMEvol-LLaMA3-8B 47.8 50.1 62.3 50.0 40.8 73.9 86.8 46.4 62.6
MMEvol-Qwen2-7B 55.8 51.6 64.1 52.4 45.1 74.7 87.8 47.7 63.9

VLMEvalKit Not Support (VQADataSet)

Model VQA_v2 GQA MIA MMSInst
MMEvol-LLaMA3-8B 83.4 65.0 78.8 32.3
MMEvol-Qwen2-7B 83.1 65.5 77.6 41.8

πŸ’‘Preparation

Dataset

Please follow LLaVA to prepare the corresponding images and data.

data structure

datasets
β”œβ”€β”€ json
β”‚   β”œβ”€β”€ allava_vflan.json
β”‚   β”œβ”€β”€ arxivqa.json
β”‚   β”œβ”€β”€ cambrain_math_code.json
β”‚   β”œβ”€β”€ data_engine.json
β”‚   β”œβ”€β”€ shargpt_40k.json
β”‚   β”œβ”€β”€ tabmwp.json
β”‚   β”œβ”€β”€ wizardlm_143k.json
β”‚   β”œβ”€β”€ mmevol_seed_no_evol_163k.json
β”‚   β”œβ”€β”€ mmevol_evol_480k.json
β”‚   └── mix_evol_sft.json
β”œβ”€β”€ ai2d
β”‚   β”œβ”€β”€ abc_images
β”‚   β”œβ”€β”€ annotations
β”‚   β”œβ”€β”€ images
β”‚   β”œβ”€β”€ questions
β”‚   └── categories.json
β”œβ”€β”€ alfword
β”‚   β”œβ”€β”€ alf-image-id-0
β”‚   β”œβ”€β”€ alf-image-id-1
β”‚   β”œβ”€β”€ alf-image-id-2
β”‚   β”œβ”€β”€ alf-image-id-3
β”‚   └── alf-image-id-4
β”œβ”€β”€ allava_vflan
β”‚   └── images
β”œβ”€β”€ arxivqa
β”‚   └── images
β”œβ”€β”€ chartqa
β”‚   β”œβ”€β”€ test
β”‚   β”œβ”€β”€ train
β”‚   └── val
β”œβ”€β”€ coco
β”‚   β”œβ”€β”€ train2014 
β”‚   β”œβ”€β”€ train2017
β”‚   β”œβ”€β”€ val2014
β”‚   └── val2017
β”œβ”€β”€ clevr
β”‚   β”œβ”€β”€ CLEVR_GoGenT_v1.0
β”‚   └── CLEVR_v1.0
β”œβ”€β”€ data_engine
β”‚   β”œβ”€β”€ partI
β”‚   β”œβ”€β”€ partII 
β”‚   └── partIII
β”œβ”€β”€ design2code
β”‚   └── images  
β”œβ”€β”€ docvqa
β”‚   β”œβ”€β”€ test
β”‚   β”œβ”€β”€ train
β”‚   └── val
β”œβ”€β”€ dvqa
β”‚   └── images
β”œβ”€β”€ geo170k
β”‚   β”œβ”€β”€ images/geo3k
β”‚   └── images/geoqa_plus
β”œβ”€β”€ geoqa+
β”‚   └── images 
β”œβ”€β”€ gpt4v-dataset
β”‚   └── images 
β”œβ”€β”€ gqa
β”‚   └── images 
β”œβ”€β”€ hfdata
β”‚   └── ....
β”œβ”€β”€ llava
β”‚   └── llava_pretrain/images
β”œβ”€β”€ llavar
β”‚   └── finetune
β”œβ”€β”€ mathvision
β”‚   └── images
β”œβ”€β”€ ocr_vqa
β”‚   └── images
β”œβ”€β”€ Q-Instruct-DB
β”‚   β”œβ”€β”€ livefb_liveitw_aigc
β”‚   └── spqa_koniq
β”œβ”€β”€ sam
β”‚   └── images
β”œβ”€β”€ scienceqa
β”‚   └── images
β”œβ”€β”€ share_textvqa
β”‚   └── images
β”œβ”€β”€ synthdog-en
β”‚   └── images
β”œβ”€β”€ tabmwp
β”‚   └── tables
β”œβ”€β”€ textbookqa
β”‚   └── tqa_train_val_test
β”œβ”€β”€ textvqa
β”‚   └── train_images
β”œβ”€β”€ vg
β”‚   β”œβ”€β”€ VG_100K
β”‚   └── VG_100K_2
β”œβ”€β”€ vizwiz
β”‚   └── train
β”œβ”€β”€ web-celebrity
β”‚   └── images
β”œβ”€β”€ web-landmark
β”‚   └── images
└── wikiart
β”‚   └── images

mmevol_evol_480k.json is the 480k evolution data evolved from the seed data mmevol_seed_no_evol_163k.json. You can freely combine other data such as allava_vflan.json for instruction ftuning (IT) training according to your personal preferences, or directly use our mixed mix_evol_sft.json for training.

πŸ“ˆ Train

Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/v1_6/train/llama3/pretrain.sh
bash scripts/v1_6/train/qwen2/pretrain.sh

Visual Instruction Tuning

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/v1_6/train/llama3/finetune.sh
bash scripts/v1_6/train/qwen2/finetune.sh

πŸ“ˆ Evaluation

Ensure that your api_base and key are correctly configured before evaluation.

opencompass

First, enter the vlmevalkit directory and install all dependencies:

cd vlmevalkit
pip install -r requirements.txt

Then, run script/run_inference.sh, which receives three input parameters in sequence: MODELNAME, DATALIST, and MODE. MODELNAME represents the name of the model, DATALIST represents the datasets used for inference, and MODE represents evaluation mode:

chmod +x ./script/run_inference.sh
./script/run_inference.sh $MODELNAME $DATALIST $MODE

The two available choices for MODELNAME are listed in vlmeval/config.py:

ungrouped = {
    'MMEvol-Llama3-V-1_6': partial(LLaVA_Llama3_V, model_path="checkpoints/xxx/checkpoint-14000"),
    'MMEvol-Qwen2-V-1_6': partial(LLaVA_Qwen2_V, model_path="checkpoints/xxx/checkpoint-14000"),
}

All available choices for DATALIST are listed in vlmeval/utils/dataset_config.py. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:

$DATALIST="MME MMMU_DEV_VAL MathVista_MINI RealWorldQA MMStar AI2D_TEST HallusionBench POPE BLINK"

While scoring on each benchmark directly, set MODE=all. If only inference results are required, set MODE=infer. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:

# run on all 9 datasets
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all

# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh MMEvol-Llama3-V-1_6 MME all
# MMMU_DEV_VAL
./script/run_inference.sh MMEvol-Llama3-V-1_6 MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
.....

# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
python llava/eval/blink_eval.py

vqadataset

For VQA and GQA dataset, please follow LLaVA for evaluation.

For MIA and MMSInst , first download the dataset and then run the following scripts for evaluation

cd mmevol
# test
python llava/eval/model_vqa_mia.py
python llava/eval/model_vqa_mminst.py
# eval
python llava/eval/mia_eval.py
python llava/eval/mminst_eval.py

πŸ‘€ Visualization

The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train Evol-Llama3-8B-Instruct and Evol-Qwen2-7B reported in our paper. To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper.

Click to expand more examples

Schedule

  • Release MMEvol-10M
  • Release training & evaluation code
  • Release model weight
  • Release evolved dataset MMEvol-480K

Citation

If you find this repo useful for your research, please consider citing the paper

@article{luo2024mmevol,
  title={Mmevol: Empowering multimodal large language models with evol-instruct},
  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
  journal={arXiv preprint arXiv:2409.05840},
  year={2024}
}

Contact

if you have any question, please consider following concat for help

Acknowledgement

- LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.

- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!