-
Clone this repository and navigate to XmodelVLM folder
git clone https://github.com/XiaoduoAILab/XmodelVLM.git cd xmodelvlm
-
Install Package
conda create -n xmodelvlm python=3.10 -y conda activate xmodelvlm pip install --upgrade pip pip install -r requirements.txt
python inference.py --model-path path/to/folder
Tip: Please make sure that you are using the latest code and related virtual environments. (Include checkpoint, vision encoder and relevant file path in config.json)
The overall architecture of our network, closely mirrors that of LLaVA-1.5 as shown in Figure 3. It consists of three key components:
- a vision encoder (CLIP ViT-L/14)
- a lightweight languagemodel (Xmodel_LM-1.1B)
- a projector responsible for aligning the visual and textual spaces as shown in Figure 4 (XDP)
Refer to our paper for more details!
The training process of Xmodel_VLM is divided into two stages as shown in Figure 5:
- stage I: pre-training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + ❄️ learnable LLM
- stage II: multi-task training
Please firstly download Xmodel_VLM checkpoints from huggingface website. (Prepare the vision encoder, such as Clip)
- prepare benchmark data
-
We evaluate models on a diverse set of 9 benchmarks, i.e. GQA, MMBench, MMBench-cn, MME, POPE, SQA, TextVQA, VizWiz, MM-Vet. For example, you should follow these instructions to manage the datasets:
-
Data Download Instructions
- download some useful data/scripts pre-collected by us.
unzip benchmark_data.zip && cd benchmark_data
bmk_dir=${work_dir}/data/benchmark_data
- gqa
- download its image data following the official instructions here
cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images
- mme
- download the data following the official instructions here.
cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images
- pope
- download coco from POPE following the official instructions here.
cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014
- sqa
- download images from the
data/scienceqa
folder of the ScienceQA repo. cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images
- download images from the
- textvqa
- download images following the instructions here.
cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images
- download some useful data/scripts pre-collected by us.
-
We provide detailed pre-training, fine-tuning and testing shell scripts (you only need to modify the corresponding model and data path), for example:
bash scripts/pretrain.sh 0,1,2,3 #GPU:0,1,2,3
If you find Xmodel_VLM useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:
@misc{xu2024xmodelvlm,
title={Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model},
author={Wanting Xu and Yang Liu and Langping He and Xucheng Huang and Ling Jiang},
year={2024},
eprint={2405.09215},
archivePrefix={arXiv},
primaryClass={cs.CV}
}