This repository contains a script for training Llama3.2-Vision with only using HuggingFace.
[Phi3-Vision Finetuning]
[Qwen2-VL Finetuning]
- Deepspeed
- LoRA, QLoRA
- Full-finetuning
- Multi-image and video training
Install the required packages using environment.yml
.
conda env create -f environment.yaml
conda activate llama
Note: Llama3.2-Vision does not support flash-attention2 for now.
The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder
.
When using a multi-image dataset, the image tokens should all be <image>
, and the image file names should have been in a list.
Please see the example below and follow format your data.
Example for single image dataset
[
{
"id": "000000033471",
"image": "000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
}
...
]
Example for multi image dataset
[
{
"id": "000000033471",
"image": ["000000033471.jpg", "000000033472.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nIs the perspective of the camera differnt?"
},
{
"from": "gpt",
"value": "Yes, It the perspective of the camera is different."
}
]
}
...
]
Example for video dataset
[
{
"id": "sample1",
"video": "sample1.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is going on in this video?"
},
{
"from": "gpt",
"value": "A man is walking down the road."
}
]
}
...
]
Note: Llama3.2-Vision uses a video as a sequential of images.
Important: It supports only batch_size 1 for now. Need to test batch_size > 1
. Leave me issue after testing it please.
To run the training script, use the following command:
bash scripts/finetune.sh
If you want to train only the language model with LoRA and perform full training for the vision model:
bash scripts/finetune_lora.sh
If you want to train both the language model and the vision model with LoRA:
bash scripts/finetune_lora_vision.sh
IMPORTANT: If you want to tune the embed_token
with LoRA, You need to tune lm_head
together.
Training arguments
--deepspeed
(str): Path to DeepSpeed config file (default: "scripts/zero2.json").--data_path
(str): Path to the LLaVA formatted training data (a JSON file). (Required)--image_folder
(str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)--model_id
(str): Path to the Llama3.2-Vision model. (Required)--output_dir
(str): Output directory for model checkpoints--num_train_epochs
(int): Number of training epochs (default: 1).--per_device_train_batch_size
(int): Training batch size per GPU per forwarding step.--gradient_accumulation_steps
(int): Gradient accumulation steps (default: 4).--freeze_vision_tower
(bool): Option to freeze vision_model (default: False).--tune_merger
(bool): Option to tune projector (default: True).--num_lora_modules
(int): Number of target modules to add LoRA (-1 means all layers).--vision_lr
(float): Learning rate for vision_model.--projector_lr
(float): Learning rate for projector.--learning_rate
(float): Learning rate for language module.--bf16
(bool): Option for using bfloat16.--fp16
(bool): Option for using fp16.--lora_namespan_exclude
(str): Exclude modules with namespans to add LoRA.--max_seq_length
(int): Maximum sequence length (default: 128K).--bits
(int): Quantization bits (default: 16).--disable_flash_attn2
(bool): Disable Flash Attention 2.--report_to
(str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').--logging_dir
(str): Logging directory (default: "./tf-logs").--lora_rank
(int): LoRA rank (default: 128).--lora_alpha
(int): LoRA alpha (default: 256).--lora_dropout
(float): LoRA dropout (default: 0.05).--logging_steps
(int): Logging steps (default: 1).--dataloader_num_workers
(int): Number of data loader workers (default: 4).
Note: The learning rate of vision_model
should be 10x ~ 5x smaller than the language_model
.
You can train the model using a video dataset. However, Llama3.2-Vision processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.
bash scripts/finetune_video.sh
If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.
bash scripts/merge_lora.sh
Note: Remember to replace the paths in finetune.sh
or finetune_lora.sh
with your specific paths. (Also in merge_lora.sh
when using LoRA.)
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
You could run unset LD_LIBRARY_PATH
for this error.
You could see this issue
- Support for multi-image & video data
- Support for batch_size > 1
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.
If you find this repository useful in your project, please consider giving a ⭐ and citing:
@misc{Llama3.2-Vision-Finetuning,
author = {Yuwon Lee},
title = {Llama3.2-Vision-Finetune},
year = {2024},
publisher = {GitHub},
url = {https://github.com/2U1/Llama3.2-Vision-Ft}
}
This project is based on
- LLaVA-NeXT: An amazing open-source project of LMM.
- Llama3.2-Vision: Awesome pretrained MLLM by Meta.