-
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
-
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
March. 23th, 2024
: Our model 🔥🔥🔥 Mipha-3B and corresponding training codes are released.Jan. 26th, 2024
:Now you can download our model weight.Jan. 15th, 2024
:Our model and training codes are released.Jan. 5th, 2024
: Our codes are currently undergoing an internal review and will be released shortly (expected next week)
Model | LLM | VQAv2 | GQA | SQAI | VQAT | POPE | MMEP | MMB |
---|---|---|---|---|---|---|---|---|
LLaVA-Phi-3B |
Phi-2-2.7B |
71.4 | - | 68.4 | 48.6 | 85.0 | 1335.1 | 59.8 |
Mipha-1.6B |
Phi-1.5-1.3B |
77.5 | 62.7 | 58.3 | 45.6 | 86.9 | 1203.1 | 57.7 |
Mipha-2.4B |
Gemma-2B |
79.5 | 63.3 | 65.3 | 52.4 | 86.6 | 1397.1 | 59.4 |
Mipha-3B |
Phi-2-2.7B |
81.3 | 63.9 | 70.9 | 56.6 | 86.7 | 1488.9 | 69.7 |
- Clone this repository and navigate to llava-phi folder
git clone https://github.com/zhuyiche/Mipha.git
cd Mipha
- Install Package
conda create -n mipha python=3.10 -y
conda activate mipha
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Download Mipha-3B at huggingface
Mipha training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.
The hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
Mipha | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
Mipha | 128 | 2e-5 | 2 | 2048 | 0 |
Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path
in get_base_model.sh
.
Our vision encoder is SigLIP-SO (0.4B). You should download the weights from here.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here.
Then, you should integrate phi-2 and SigLIP-SO into a single model by running the following script:
bash ./script/mipha/get_base_model.sh
bash ./scripts/mipha/pretrain.sh
Please refer here to prepare the instruction tuning data.
Training script with DeepSpeed ZeRO-3: finetune.sh
.
bash ./scripts/mipha/finetune.sh
To ensure the reproducibility, we evaluate the models with greedy decoding.
See Evaluation.md.
You can chat about images using Mipha without the Gradio interface. Here is an example command:
python -m mipha.serve.cli \
--model-path /path/to/mipha-3B \
--image-file "mipha/serve/examples/extreme_ironing.jpg" \
--conv-mode phi
If you find LLaVA-Phi or Mipha useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:
@misc{zhu2024llavaphi,
title={LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model},
author={Yichen Zhu and Minjie Zhu and Ning Liu and Zhicai Ou and Xiaofeng Mou and Jian Tang},
year={2024},
eprint={2401.02330},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{zhu2024comprehensive,
title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models},
author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian},
journal={arXiv preprint arXiv:2403.06199},
year={2024}
}
We build our project based on
- LLaVA: an amazing open-sourced project for vision language assistant
- LLaMA-Factory: We use this codebase to finetune SLMs
- Safe-RLHF: We use this codebase to instruct-tune SLMs