This repo provides the scripts and instructions to build a custom VLM using the Prismatic VLM repository. The model details are as follows,
- Vision Encoder - DinoV2 + SigLIP @384px resolution. Why 2 vision encoders?
- Connector - MLP (Dino and SigLIP features are concatenated and then projected to Phi3 representation space)
- Language Model - Phi3 + LoRA
- Pre-train (Align) Dataset - LLaVA-CC3M-Pretrain-595K
- Fine-tune (Instruction) Dataset - LLAVA-v1.5-Instruct + LRV-Instruct
Dataset | VQAv2 | POPE | AI2D | TextVQA |
---|---|---|---|---|
Accuracy (%) | 63.3 | 86.3 | 58.9 | 46.8 |
The weights for "align" and "finetune" stage is available in nms05/Dinov2-SigLIP-Phi3-LoRA.
Clone this repo and follow the installation instructions here. Additionally run the following.
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/peft
python scripts/preprocess.py --dataset_id "llava-laion-cc-sbu-558k" --root_dir training_data/
python scripts/preprocess.py --dataset_id "llava-v1.5-instruct" --root_dir training_data/
Instructions and scripts for downloading LRV-Instruct datasets can be found in scripts/additional-datasets
.
- LLM and LoRA Config: The microsoft/Phi-3-mini-4k-instruct model from HuggingFace is added in
prismatic/models/backbones/llm/phi3.py
. The LoRA configuration is also specified here. - Instruction Template: Phi3 is intruction tuned and follows a specific prompt template
prismatic/models/backbones/llm/prompting/phi3_chat_prompter.py
. - LoRA: From the LoRA configuration in 1, the LoRA layers are added to the base LLM (phi-3) using the HuggingFace PEFT library in
prismatic/models/backbones/llm/base_llm.py
- Freeze LLM Params: The get_peft_model() function freezes the LLM layers and finetunes only LoRA params. Make sure to comment line-153 in
prismatic/models/vlms/prismatic.py
, which finetunes the entire LLM. - Update Entries: Update
prismatic/models/backbones/llm/__init__.py
with the new LLM. - Update Entries: Update LLM_BACKBONES registry in
prismatic/models/materialize.py
- Update Entries: Finally add a new entry for your entire VLM in
prismatic/conf/models.py
. This is also where you specify the Vision Backbone, Connector type (linear or MLP), and the image resizing strategy.
The entry point for training models is scripts/pretrain.py
. Specify the desired model config, dataset config, stage (align or fine-tune) etc.
Note: set enable_peft = False in prismatic/models/backbones/llm/phi3.py
(line 63), for "align" stage training.
# Run from the root of the repository
torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain.py
- The weights for the "align" stage (trains MLP connector) and the "finetune" stage (requires MLP weights and trains MLP+LoRA) is available in nms05/Dinov2-SigLIP-Phi3-LoRA.
- Download them to runs/
- For training hyperparameters, refer to the config files.
The model in evaluated using the official prismatic-eval repository.
Results for Model `dino-siglip-phi3-lora-model` on vqa-v2-slim
=> Accuracy (Official): 63.300
=====================================
[*] Results for Model `dino-siglip-phi3-lora-model` on ai2d-slim (val/test/final)
=> AI2D-val Accuracy (Official): 0.597
=> AI2D-val ROC AUC (Official): 0.854
=> AI2D-val PR AUC (Official): 0.726
=> AI2D-test Accuracy (Official): 0.582
=> AI2D-test ROC AUC (Official): 0.848
=> AI2D-test PR AUC (Official): 0.717
=> AI2D-final Accuracy (Official): 0.589
=> AI2D-final ROC AUC (Official): 0.851
=> AI2D-final PR AUC (Official): 0.722
=====================================
Results for Model `dino-siglip-phi3-lora-model` on text-vqa-slim
=> TextVQA-OCR Accuracy (Official): 0.468
=> TextVQA-Pure Accuracy (Official): 0.341
=====================================
Results for Model `dino-siglip-phi3-lora-model` on pope-slim (adversarial/popular/random)
=> POPE-adversarial Accuracy (Official): 0.854
=> POPE-adversarial ROC AUC (Official): 0.930
=> POPE-adversarial PR AUC (Official): 0.937
=> POPE-popular Accuracy (Official): 0.870
=> POPE-popular ROC AUC (Official): 0.950
=> POPE-popular PR AUC (Official): 0.949
=> POPE-random Accuracy (Official): 0.864
=> POPE-random ROC AUC (Official): 0.964
=> POPE-random PR AUC (Official): 0.969
=> POPE-final Accuracy (Official): 0.863
=> POPE-final ROC AUC (Official): 0.947
=> POPE-final PR AUC (Official): 0.950
Run scripts/generate.py
to chat with the model via terminal.
Instruction: "Provide a detailed description of the given image."
Response:
The image features a dining table with a white plate containing a breakfast meal. The plate is filled with various food items, including eggs, toast, and orange slices.
There are also a couple of sandwiches on the plate. In addition to the plate, there are several cups and a bottle placed on the table. A knife and a fork can be seen near the plate, ready for use.
The table is surrounded by multiple chairs, with some people sitting on them, enjoying their meal. The scene appears to be a casual dining setting, with people gathered around the table to share a meal together.
@article{karamcheti2024prismatic,
title = {Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models},
author = {Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh},
journal = {arXiv preprint arXiv:2402.07865},
year = {2024},
}