scott-mao/SliME

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

PythonApache-2.0

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

[📖 arXiv Paper] [📊 Dataset][🏆 Models]

🔥 Update

[06/11]🔥SliME is coming! We release the paper, code, models, and data for SliME!
[06/11]🔥SliME-70B will be released soon.

👀 Contents

Install
Model
Preparation
Train
Evaluation
Examples
Citation

🔮 Install

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/yfzhang114/SliME.git

Install Package

conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation

🔍 Model

We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:

Model	Base LLM	Vision Encoder	Finetuning Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	SharedGPT+SMR	Lora	ckpt

Here are the pretrained weights on Stage 1/2 data only:

Model	Base LLM	Vision Encoder	Pretrain Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt

🔮 Preparation

Dataset

Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.

SMR data structure

data
├── arxivqa
│   └── images
├── DVQA
│   └── images
├── Geometry3K
│   └── 0-2400 dirs
├── ChartQA
│   └── train_images
└── GeoQA3
│    ├── image
│    └── json
├── mathvision
├── scienceqa
├── tabmwp
└── GeoQA3
│    ├── train
│    └── test
│    └── val
└── ai2d
│    ├── abc_images
│    └── images
└── geoqa+
│   └── images

Arxiv QA Download images using this download url

python playground/data/process_arxivqa.py

DVQA

Download images using this url.

ChartQA

Clone this repo

extract all the training images in ChartQA_Dataset/train/png into ChartQA

Geometry3K

Download images using this url.

The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')

GeoQA3

Download images using this url

extract all the training images in GeoQA3/image

MathVision

Download images using this url

Our data will not include the images from test-mini split automatically

ScienceQA

wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip

unzip -q train.zip
unzip -q val.zip
unzip -q test.zip

rm train.zip
rm val.zip
rm test.zip

Tabmwp

Download images using this url

TextbookQA

Download images using this url

AI2D:

Download images using this url

GeoQA+

Download images using this url

📈 Train

Click to see the detail model structure

SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.

SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:

bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh

or for SliME-8B with image size 336:

bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh

Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR:

bash scripts/llama/llama3_8b_sft.sh

Please find more training scripts of in scripts/.

📈 Evaluation

We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with SliME-7B:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

👀 Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

Click to expand more examples

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{zhang2024llavahd,
      title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models}, 
      author={Yi-Fan Zhang and Qingsong Wen and Chaoyou Fu and Xue Wang and Zhang Zhang and Liang Wang and Rong Jin},
      year={2024},
      eprint={2406.08487},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon the LLaVA.
This work utilizes LLMs from , Vicuna, and Llama3.

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.