mLLaVA

A fork of the LLaVA codebase with added support for using the Baichuan2 models as the base LLM. This codebase is a dependency of the experiments in See It From My Perspective.

Please note that we have also made these models available on HuggingFace.

Models

Models were trained using the v1.5 pretraining and LORA fine-tuning scripts. We use the original LLaVA v1 fusion corpus (for Chinese, we use the translation shared by LinkSoul).

Each model directory is around 600MB unzipped as they only contain 1) the projector and 2) the LORA weights. The codebase will download the rest of the model (Llama2-7B or Baichuan2-7B-Chat and CLIP-L).

Base LLM	Fusion Corpus Language(s)	Download Link
Llama2-7B-Chat	en	link
	zh	link
	en/zh	link
Baichuan2-7B-Chat	en	link
	zh	link
	en/zh	link

If you use any of these models, please cite:

Visual Instruction Tuning, Liu et al., 2024
See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding, Ananthram et al., 2024

And https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions.

Example Usage

For a minimal example of using one of our models for inference, please see predict.py.

amith-ananthram/mLLaVA

mLLaVA

Models

Example Usage