/Phi3V-Finetuning

Parameter-efficient finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.

Primary LanguagePythonApache License 2.0Apache-2.0

English | 简体中文

Training Phi3-V with PEFT

This repository contains a script for training the Phi3-V model with Parameter-Efficient Fine-Tuning (PEFT) techniques using various configurations and options.

Table of Contents

Supported Features

  • Training on the mixture of NLP data and vision-language data
  • Flexible selection of LoRA target modules
  • Deepspeed Zero-2
  • Deepspeed Zero-3
  • PyTorch FSDP
  • Gradient checkpointing (only compatible with ZeRO-3 for now)
  • QLoRA
  • Disable/enable Flash Attention 2

Installation

Install the required packages using either requirements.txt or environment.yml.

Using requirements.txt

pip install -r requirements.txt

Using environment.yml

conda env create -f environment.yml
conda activate phi3v

Model Download

Before training, download the Phi3-V model from HuggingFace. It is recommended to use the huggingface-cli to do this.

  1. Install the HuggingFace CLI:
pip install -U "huggingface_hub[cli]"
  1. Download the model:
huggingface-cli download microsoft/Phi-3-vision-128k-instruct --local-dir Phi-3-vision-128k-instruct --resume-download

Usage

To run the training script, use the following command:

bash scripts/train.sh

Note: Remember to replace the paths in train.sh with your specific paths.

Arguments

  • --data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
  • --image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
  • --model_id (str): Path to the Phi3-V model. (Required)
  • --proxy (str): Proxy settings (default: None).
  • --output_dir (str): Output directory for model checkpoints (default: "output/test_train").
  • --num_train_epochs (int): Number of training epochs (default: 1).
  • --per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
  • --gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
  • --deepspeed_config (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
  • --num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
  • --lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
  • --max_seq_length (int): Maximum sequence length (default: 3072).
  • --quantization (flag): Enable quantization.
  • --disable_flash_attn2 (flag): Disable Flash Attention 2.
  • --report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
  • --logging_dir (str): Logging directory (default: "./tf-logs").
  • --lora_rank (int): LoRA rank (default: 128).
  • --lora_alpha (int): LoRA alpha (default: 256).
  • --lora_dropout (float): LoRA dropout (default: 0.05).
  • --logging_steps (int): Logging steps (default: 1).
  • --dataloader_num_workers (int): Number of data loader workers (default: 4).

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

Example Dataset
[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]

TODO

  • Add support for DeepSpeed ZeRO-3.
  • Add support for FSDP
  • Add support for simultaneously finetuning img_projector
  • Add support for full finetuning
  • Add support for grounded finetuning
  • Add support for multi-image finetuning
  • More advanced PEFT method (e.g., DoRA)
  • FSDP with ActivationCheckpointing Wrapper
  • Intergration with Chuanhu Chat

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

This project borrowed code from LLaVA and Microsoft Phi-3-vision-128k-instruct. Thanks to both projects for their contributions.

Citation

If you use this codebase in your work, please cite this project:

@misc{phi3vfinetuning2023,
  author = {Gai Zhenbiao & Shao Zhenwei},
  title = {Phi3V-Finetuning},
  year = {2023},
  publisher = {GitHub},
  url = {https://github.com/GaiZhenbiao/Phi3V-Finetuning},
  note = {GitHub repository},
}