Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

This repository provides the official PyTorch implementation of the following paper:

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang^1,2, Xiaoyi Dong^2,3, Pan Zhang², Yuhang Zang ², Yuhang Cao ², Jiaqi Wang², Dahua Lin², Weiming Zhang¹, Nenghai Yu¹
¹University of Science and Technology of China, ²Shanghai AI Laboratory, ³The Chinese University of Hong Kong

🎯 News

[2024.10.10] 🚀 We release the paper at ArXiv and HuggingFace!

[2024.10.10] 🚀 This project page has been built!

👨‍💻 Todo

Release the code of MIR
Release the training code and evaluation code of MoCa
Release the checkpoints of MoCa

⭐️ TL;DR

1. For MIR

If you just want to use MIR as the pre-training indicator of your own model, no additional environment is required.

Ensure the packages such as torch, numpy, and scipy are installed.
Replace the model preprocessing and generation in mir.py with your own model's code, we display LLaVA's code as the reference.
Specify the input args and run the command:

python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --text_data_path PATH/TO/TEXT/DATA --image_data_path PATH/TO/VISION/DATA --eval_num 100 --mode fast

Note that base_llm is not required if you haven't train the base LLM during pre-training.

You can also adjust the args to the intialization style of your model.

2. For MoCa

If you just want to use MoCa on your own model, we recommand you to following the steps below:

Copy the code of MoCa module into the modeling code of your own model and ensure MoCa is equipped by the base LLM layer in both initialization and forward functions.
Make sure that the input preprocessing can compute the modality_mask, please refer to Line183-184, Line269-276 and Line373-382 in llava/model/llava_arch.py. Also, make sure that the modality_mask can be successsfully delivered into the model forward pass, e.g., adding it as the formal parameter of each forward function, like Line70, Line88, Line96, Line106, Line127, Line137, Line145, Line157, Line166, Line174-175 in llava/model/language_model/llava_llama.py.
Check some details to support the usage of use_moca=True, such as (it is recommanded to search use_moca in this repo to find which places should be revised): 1）Add it into the model config (here). 2) Add it into training arguments (here). 3) Unlock it during training (here). 4) Ensure the correct checkpoint saving (here1, here2, here3).
Add --use_moca when running the training command to enable the usage of MoCa.

📜 Setup

If you want to use our codebase (modified on LLaVA) for reproduction, you are recommanded to build a new environment though the steps below. The following steps are just listed for Linux. If you are using macOS or Windows, please refer to LLaVA.

Clone this repository and navigate to Modality-Integration-Rate folder

git clone https://github.com/shikiw/Modality-Integration-Rate.git
cd Modality-Integration-Rate

Install Package

conda create -n llava python=3.10 -y
conda activate llava
python -m pip install --upgrade pip  # enable PEP 660 support
python -m pip install -e .
python -m pip install -e transformers-4.37.2

Install additional packages for training cases

pythom -m pip install -e ".[train]"
pythom -m pip install flash-attn --no-build-isolation

MIR

To reproduce the MIR implementation on this codebase, you can follow these steps:

Specify the text_data_path and image_data_path for MIR calculation. You can also specify them like Line55-64 in mir.py, using TextVQA val images and CNN/DM text by default, i.e.,
1. Download TextVQA_0.5.1_val.json and images and extract to PATH/TO/VISION/DATA.
2. Download CNN stories and extract to PATH/TO/TEXT/DATA.
3. Modify Line55-64 with the text data path and image data path.
If you pre-train only MLP, run this command:

python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --eval_num 100 --mode fast

If your pre-train any part of ViT or base LLM, run this command:

python mir.py --model_path PATH/TO/MODEL --eval_num 100 --mode fast

MoCa

Our codebase supports --use_moca to activate the implementation of MoCa. Check out scripts/v1_5/pre_sft_moca.sh for more details.

Model	Size	Schedule	Average	MMStar	MME	MMB	MMB-CN	SEED-IMG	TextVQA	MM-Vet	POPE	GQA
LLaVA-v1.5	7B	full_ft-1e	59.1	30.3	1510.7	64.3	58.3	66.1	58.2	31.1	85.9	62.0
+MoCa	7B	full_ft-1e	60.6	36.5	1481.0	66.8	60.0	67.0	58.7	32.2	86.9	62.8

The pretrained and finetuned checkpoints are released.

Train

This codebase is based on LLaVA and ShareGPT4V, where we introduce some new features and now it supports the following inputs in the launch script:

--tune_vision_tower and --tune_vit_from_layer
--tune_language_model and --tune_llm_utill_layer
--tune_entire_model
--data_scale
--use_moca and --moca_std

Some cases for reference:

To pre-train the model with the customized data scale (e.g., 200K):

sh scripts/v1_5/pre_data_scale.sh

To pre-train the model (unlock the 13-24 layer of ViT and the 1-16 layer of base LLM), and SFT (unlock entire LLM by default):

sh scripts/v1_5/pre_unlock_vit-12_llm-16_sft.sh

To pre-train the model (unlock the 13-24 layer of ViT and the entire base LLM), and SFT (unlock entire LLM by default):

sh scripts/v1_5/pre_unlock_vit-12_llm-all_sft.sh

To apply MoCa in training:

sh scripts/v1_5/pre_sft_moca.sh

Evaluation

We follow the original evaluation in LLaVA for most of benchmarks. For MMStar, we use VLMEvalKit.

See Evaluation.md.

Acknowledgement

This repo is based on the codebase of LLaVA and ShareGPT4V. Thanks for their impressive works!

Citation

If you find this work useful for your research, please cite our paper:

@article{huang2024deciphering,
  title={Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate},
  author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai},
  journal={arXiv preprint arXiv:2410.07167},
  year={2024}
}