LLaVA-Phi: Small Multi-Modal Assistant

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model [Paper]

Release

[1/26] Now you can download our model weight.
[1/15] Our model and training codes are released.
[1/5] Our codes are currently undergoing an internal review and will be released shortly (expected next week)

Install
LLaVA-Phi Weights
Train
Evaluation

Install

Clone this repository and navigate to llava-phi folder

git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi

Install Package

conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

LLaVA-Phi Weights

Download model weight at huggingface

Training Curve

The training curve can be found at wandb

Train

LLaVA-Phi training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.

Hyperparameters

We use a similar set of hyperparameters as LLaVA-1.5 in both pretraining and finetuning phase. Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
LLaVA-Phi	256	1e-3	1	2048	0

Finetuning

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
LLaVA-Phi	128	2e-5	1	2048	0

Download base checkpoints

Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path in get_base_model.sh.
Our vision encoder is ViT-L/14 336px. You should download the weights from here.

Integrate the model

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here.

Then, you should integrate phi-2 and ViT-L/14 336px into a single model by running the following script:

bash ./script/llava_phi/get_base_model.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./base_checkpoints_llava_phi

Pretrain (feature alignment)

bash ./scripts/llava_phi/pretrain.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./checkpoints/llavaPhi-v0-3b-pretrain

Visual Instruction Tuning

Please refer here to prepare the instruction tuning data.

Training script with DeepSpeed ZeRO-3: finetune.sh.

bash ./scripts/llava_phi/finetune.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./checkpoints/llavaPhi-v0-3b-finetune

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See Evaluation.md.

Citation

If you find LLaVA-Phi useful for your research and applications, please cite using this BibTeX:

@misc{zhu2024llavaphi,
      title={LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model}, 
      author={Yichen Zhu and Minjie Zhu and Ning Liu and Zhicai Ou and Xiaofeng Mou and Jian Tang},
      year={2024},
      eprint={2401.02330},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We build our project based on

LLaVA: an amazing open-sourced project for vision language assistant
LLaMA-Factory: We use this codebase to finetune Phi model

NtaylorOX/llava-phi