/mLLaVA

Fork of the LLaVA repository, adapted to train multilingual LORA-based LLaVA variants.

Primary LanguagePythonApache License 2.0Apache-2.0

mLLaVA

A fork of the LLaVA codebase with added support for using the Baichuan2 models as the base LLM. This codebase is a dependency of the experiments in See It From My Perspective.

Please note that we have also made these models available on HuggingFace.

Models

Models were trained using the v1.5 pretraining and LORA fine-tuning scripts. We use the original LLaVA v1 fusion corpus (for Chinese, we use the translation shared by LinkSoul).

Each model directory is around 600MB unzipped as they only contain 1) the projector and 2) the LORA weights. The codebase will download the rest of the model (Llama2-7B or Baichuan2-7B-Chat and CLIP-L).

Base LLM Fusion Corpus Language(s) Download Link
Llama2-7B-Chat en link
zh link
en/zh link
Baichuan2-7B-Chat en link
zh link
en/zh link

If you use any of these models, please cite:

  1. Visual Instruction Tuning, Liu et al., 2024
  2. See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding, Ananthram et al., 2024

And https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions.

Example Usage

For a minimal example of using one of our models for inference, please see predict.py.