/llava-phi

Primary LanguagePython

  • LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
    arXiv

  • Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
    arXiv

📸 Release

  • March. 23th, 2024: Our model 🔥🔥🔥 Mipha-3B and corresponding training codes are released.
  • Jan. 26th, 2024:Now you can download our model weight.
  • Jan. 15th, 2024:Our model and training codes are released.
  • Jan. 5th, 2024: Our codes are currently undergoing an internal review and will be released shortly (expected next week)

Model Zoo

Mipha & LLaVA-Phi

Model LLM VQAv2 GQA SQAI VQAT POPE MMEP MMB
LLaVA-Phi-3B
Phi-2-2.7B
71.4 - 68.4 48.6 85.0 1335.1 59.8
Mipha-1.6B
Phi-1.5-1.3B
77.5 62.7 58.3 45.6 86.9 1203.1 57.7
Mipha-2.4B
Gemma-2B
79.5 63.3 65.3 52.4 86.6 1397.1 59.4
Mipha-3B
Phi-2-2.7B
81.3 63.9 70.9 56.6 86.7 1488.9 69.7

Contents

Install

  1. Clone this repository and navigate to llava-phi folder
git clone https://github.com/zhuyiche/Mipha.git
cd Mipha
  1. Install Package
conda create -n mipha python=3.10 -y
conda activate mipha
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Mipha Weights

Download Mipha-3B at huggingface

Train

Mipha training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.

Hyperparameters

The hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
Mipha 256 1e-3 1 2048 0
  1. Finetuning
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
Mipha 128 2e-5 2 2048 0

Download base checkpoints

Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path in get_base_model.sh.
Our vision encoder is SigLIP-SO (0.4B). You should download the weights from here.

Integrate the model

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here.

Then, you should integrate phi-2 and SigLIP-SO into a single model by running the following script:

bash ./script/mipha/get_base_model.sh

Pretrain (feature alignment)

bash ./scripts/mipha/pretrain.sh

Visual Instruction Tuning

Please refer here to prepare the instruction tuning data.

Training script with DeepSpeed ZeRO-3: finetune.sh.

bash ./scripts/mipha/finetune.sh

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See Evaluation.md.

CLI Inference Guide

You can chat about images using Mipha without the Gradio interface. Here is an example command:

python -m mipha.serve.cli \
    --model-path /path/to/mipha-3B \
    --image-file "mipha/serve/examples/extreme_ironing.jpg" \
    --conv-mode phi

Citation

If you find LLaVA-Phi or Mipha useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@misc{zhu2024llavaphi,
      title={LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model}, 
      author={Yichen Zhu and Minjie Zhu and Ning Liu and Zhicai Ou and Xiaofeng Mou and Jian Tang},
      year={2024},
      eprint={2401.02330},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{zhu2024comprehensive,
  title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models},
  author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian},
  journal={arXiv preprint arXiv:2403.06199},
  year={2024}
}

Acknowledgement

We build our project based on

  • LLaVA: an amazing open-sourced project for vision language assistant
  • LLaMA-Factory: We use this codebase to finetune SLMs
  • Safe-RLHF: We use this codebase to instruct-tune SLMs