llama3v

llama3v is a SOTA vision model that is powered by Llama3 8B and siglip-so400m.

[ GitHub ] [ Model Weights ] [ Blog Post ]

Features

SOTA open-source VLLM
Model is available on Huggingface
Fast local inference
Release inference code (training code is coming soon, just cleaning up)

Checkout huggingface for the model weights.

Usage

You can use llama3v with the Transformers library.

from transformers import AutoTokenizer, AutoModel
from PIL import Image

model = AutoModel.from_pretrained("mustafaaljadery/llama3v").cuda()
tokenizer = AutoTokenizer.from_pretrained("mustafaaljadery/llama3v")

image = Image.open("test_image.png")

answer = model.generate(image=image, message="What is this image?", temperature=0.1, tokenizer=tokenizer)

print(answer)

The model first passes through the image through the vision model to extract the features, then pass through the language model to generate the answer. Here is a sample inference pipeline:

Training Process

In our training process, we combine the siglip-so400m model for vision with the Llama3 8B model for multi-modal image-text input with text generation.

We add a projection layer to the siglip-so400m model to project the image features to the LLaMA embedding space for the model to better understand the image.

In the pretraining process, we use freeze all the weights other than the projection layer. We train on about 600K images.

In the fine-tuning process, we update the weights of the Llama3 8B model while freezing the weights of the siglip-so400m model and the projection layer. We train for approximately 1M images. Moreover, we generate synthetic multimodal data from YI's model family for multimodal text generation as well. We finetune our model on this synsthetic data.

Read more about our training process here.

Acknowledgements

Citations