Xmodel_VLM: A Simple Baseline for Multimodal Vision Language Model

🛠️ Install

Clone this repository and navigate to XmodelVLM folder

git clone https://github.com/XiaoduoAILab/XmodelVLM.git
cd xmodelvlm

Install Package

conda create -n xmodelvlm python=3.10 -y
conda activate xmodelvlm
pip install --upgrade pip
pip install -r requirements.txt

🗝️ Quick Start

Example for Xmodel_VLM model inference

python inference.py --model-path path/to/folder

Tip: Please make sure that you are using the latest code and related virtual environments. (Include checkpoint, vision encoder and relevant file path in config.json)

🪜 Step-by-step Tutorial

Xmodel_VLM

The overall architecture of our network, closely mirrors that of LLaVA-1.5 as shown in Figure 3. It consists of three key components:

a vision encoder (CLIP ViT-L/14)
a lightweight languagemodel (Xmodel_LM-1.1B)
a projector responsible for aligning the visual and textual spaces as shown in Figure 4 (XDP)

Refer to our paper for more details!

The training process of Xmodel_VLM is divided into two stages as shown in Figure 5:

stage I: pre-training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + ❄️ learnable LLM
stage II: multi-task training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + 🔥 learnable LLM

1️⃣ Prepare Xmodel_VLM checkpoints

Please firstly download Xmodel_VLM checkpoints from huggingface website. (Prepare the vision encoder, such as Clip)

2️⃣ Prepare data

prepare benchmark data
- We evaluate models on a diverse set of 9 benchmarks, i.e. GQA, MMBench, MMBench-cn, MME, POPE, SQA, TextVQA, VizWiz, MM-Vet. For example, you should follow these instructions to manage the datasets:
- Data Download Instructions
  - download some useful data/scripts pre-collected by us.
    - unzip benchmark_data.zip && cd benchmark_data
    - bmk_dir=${work_dir}/data/benchmark_data
  - gqa
    - download its image data following the official instructions here
    - cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images
  - mme
    - download the data following the official instructions here.
    - cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images
  - pope
    - download coco from POPE following the official instructions here.
    - cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014
  - sqa
    - download images from the data/scienceqa folder of the ScienceQA repo.
    - cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images
  - textvqa
    - download images following the instructions here.
    - cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images

3️⃣ Run everything with one click!

We provide detailed pre-training, fine-tuning and testing shell scripts (you only need to modify the corresponding model and data path), for example:

bash scripts/pretrain.sh 0,1,2,3  #GPU:0,1,2,3

🤝 Acknowledgments

LLaVA: Thanks for their wonderful work! 👏
MobileVLM: Thanks for their wonderful work! 👏

✏️ Reference

If you find Xmodel_VLM useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@misc{xu2024xmodelvlm,
      title={Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model}, 
      author={Wanting Xu and Yang Liu and Langping He and Xucheng Huang and Ling Jiang},
      year={2024},
      eprint={2405.09215},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}