A fork of the LLaVA codebase with added support for using the Baichuan2 models as the base LLM. This codebase is a dependency of the experiments in See It From My Perspective.
Please note that we have also made these models available on HuggingFace.
Models were trained using the v1.5 pretraining and LORA fine-tuning scripts. We use the original LLaVA v1 fusion corpus (for Chinese, we use the translation shared by LinkSoul).
Each model directory is around 600MB unzipped as they only contain 1) the projector and 2) the LORA weights. The codebase will download the rest of the model (Llama2-7B or Baichuan2-7B-Chat and CLIP-L).
Base LLM | Fusion Corpus Language(s) | Download Link |
---|---|---|
Llama2-7B-Chat | en | link |
zh | link | |
en/zh | link | |
Baichuan2-7B-Chat | en | link |
zh | link | |
en/zh | link |
If you use any of these models, please cite:
- Visual Instruction Tuning, Liu et al., 2024
- See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding, Ananthram et al., 2024
And https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions.
For a minimal example of using one of our models for inference, please see predict.py.