/LLamaTuner

Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Falcon) 大模型高效量化训练+部署.

Primary LanguagePythonApache License 2.0Apache-2.0

 

GitHub Repo stars GitHub Code License GitHub last commit GitHub pull request issue resolution open issues Python 3.9+ Code style: black

👋🤗🤗👋 Join our WeChat.

Easy and Efficient Fine-tuning LLMs --- 简单高效的大语言模型训练/部署

中文 | English

Introduction

LLamaTuner is an efficient, flexible and full-featured toolkit for fine-tuning LLM (Llama3, Phi3, Qwen, Mistral, ...)

Efficient

  • Support LLM, VLM pre-training / fine-tuning on almost all GPUs. LLamaTuner is capable of fine-tuning 7B LLM on a single 8GB GPU, as well as multi-node fine-tuning of models exceeding 70B.
  • Automatically dispatch high-performance operators such as FlashAttention and Triton kernels to increase training throughput.
  • Compatible with DeepSpeed 🚀, easily utilizing a variety of ZeRO optimization techniques.

Flexible

  • Support various LLMs (Llama 3, Mixtral, Llama 2, ChatGLM, Qwen, Baichuan, ...).
  • Support VLM (LLaVA).
  • Well-designed data pipeline, accommodating datasets in any format, including but not limited to open-source and custom formats.
  • Support various training algorithms (QLoRA, LoRA, full-parameter fune-tune), allowing users to choose the most suitable solution for their requirements.

Full-featured

  • Support continuous pre-training, instruction fine-tuning, and agent fine-tuning.
  • Support chatting with large models with pre-defined templates.

Table of Contents

Supported Models

Model Model size Default module Template
Baichuan 7B/13B W_pack baichuan
Baichuan2 7B/13B W_pack baichuan2
BLOOM 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
BLOOMZ 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
ChatGLM3 6B query_key_value chatglm3
Command-R 35B/104B q_proj,v_proj cohere
DeepSeek (MoE) 7B/16B/67B/236B q_proj,v_proj deepseek
Falcon 7B/11B/40B/180B query_key_value falcon
Gemma/CodeGemma 2B/7B q_proj,v_proj gemma
InternLM2 7B/20B wqkv intern2
LLaMA 7B/13B/33B/65B q_proj,v_proj -
LLaMA-2 7B/13B/70B q_proj,v_proj llama2
LLaMA-3 8B/70B q_proj,v_proj llama3
LLaVA-1.5 7B/13B q_proj,v_proj vicuna
Mistral/Mixtral 7B/8x7B/8x22B q_proj,v_proj mistral
OLMo 1B/7B q_proj,v_proj -
PaliGemma 3B q_proj,v_proj gemma
Phi-1.5/2 1.3B/2.7B q_proj,v_proj -
Phi-3 3.8B qkv_proj phi
Qwen 1.8B/7B/14B/72B c_attn qwen
Qwen1.5 (Code/MoE) 0.5B/1.8B/4B/7B/14B/32B/72B/110B q_proj,v_proj qwen
StarCoder2 3B/7B/15B q_proj,v_proj -
XVERSE 7B/13B/65B q_proj,v_proj xverse
Yi (1/1.5) 6B/9B/34B q_proj,v_proj yi
Yi-VL 6B/34B q_proj,v_proj yi_vl
Yuan 2B/51B/102B q_proj,v_proj yuan

Supported Training Approaches

Approach Full-tuning Freeze-tuning LoRA QLoRA
Pre-Training
Supervised Fine-Tuning
Reward Modeling
PPO Training
DPO Training
KTO Training
ORPO Training

Supported Datasets

As of now, we support the following datasets, most of which are all available in the Hugging Face datasets library.

Supervised fine-tuning dataset
Preference datasets

Please refer to data/README.md to learn how to use these datasets. If you want to explore more datasets, please refer to the awesome-instruction-datasets. Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Data Preprocessing

We provide a number of data preprocessing tools in the data folder. These tools are intended to be a starting point for further research and development.

Model Zoo

We provide a number of models in the Hugging Face model hub. These models are trained with QLoRA and can be used for inference and finetuning. We provide the following models:

Base Model Adapter Instruct Datasets Train Script Log Model on Huggingface
llama-7b FullFinetune - - -
llama-7b QLoRA openassistant-guanaco finetune_lamma7b wandb log GaussianTech/llama-7b-sft
llama-7b QLoRA OL-CC finetune_lamma7b
baichuan7b QLoRA openassistant-guanaco finetune_baichuan7b wandb log GaussianTech/baichuan-7b-sft
baichuan7b QLoRA OL-CC finetune_baichuan7b wandb log -

Requirement

Mandatory Minimum Recommend
python 3.8 3.10
torch 1.13.1 2.2.0
transformers 4.37.2 4.41.0
datasets 2.14.3 2.19.1
accelerate 0.27.2 0.30.1
peft 0.9.0 0.11.1
trl 0.8.2 0.8.6
Optional Minimum Recommend
CUDA 11.6 12.2
deepspeed 0.10.0 0.14.0
bitsandbytes 0.39.0 0.43.1
vllm 0.4.0 0.4.2
flash-attn 2.3.0 2.5.8

Hardware Requirement

* estimated

Method Bits 7B 13B 30B 70B 110B 8x7B 8x22B
Full AMP 120GB 240GB 600GB 1200GB 2000GB 900GB 2400GB
Full 16 60GB 120GB 300GB 600GB 900GB 400GB 1200GB
Freeze 16 20GB 40GB 80GB 200GB 360GB 160GB 400GB
LoRA/GaLore/BAdam 16 16GB 32GB 64GB 160GB 240GB 120GB 320GB
QLoRA 8 10GB 20GB 40GB 80GB 140GB 60GB 160GB
QLoRA 4 6GB 12GB 24GB 48GB 72GB 30GB 96GB
QLoRA 2 4GB 8GB 16GB 24GB 48GB 18GB 48GB

Getting Started

Clone the code

Clone this repository and navigate to the Efficient-Tuning-LLMs folder

git clone https://github.com/jianzhnie/LLamaTuner.git
cd LLamaTuner

Getting Started

main function Useage Scripts
train.py Full finetune LLMs on SFT datasets full_finetune
train_lora.py Finetune LLMs by using Lora (Low-Rank Adaptation of Large Language Models finetune) lora_finetune
train_qlora.py Finetune LLMs by using QLora (QLoRA: Efficient Finetuning of Quantized LLMs) qlora_finetune

QLora int4 Finetune

The train_qlora.py code is a starting point for finetuning and inference on various datasets. Basic command for finetuning a baseline model on the Alpaca dataset:

python train_qlora.py --model_name_or_path <path_or_name>

For models larger than 13B, we recommend adjusting the learning rate:

python train_qlora.py –learning_rate 0.0001 --model_name_or_path <path_or_name>

To find more scripts for finetuning and inference, please refer to the scripts folder.

Known Issues and Limitations

Here a list of known issues and bugs. If your issue is not reported here, please open a new issue and describe the problem.

  1. 4-bit inference is slow. Currently, our 4-bit inference implementation is not yet integrated with the 4-bit matrix multiplication
  2. Resuming a LoRA training run with the Trainer currently runs on an error
  3. Currently, using bnb_4bit_compute_type='fp16' can lead to instabilities. For 7B LLaMA, only 80% of finetuning runs complete without error. We have solutions, but they are not integrated yet into bitsandbytes.
  4. Make sure that tokenizer.bos_token_id = 1 to avoid generation issues.

License

LLamaTuner is released under the Apache 2.0 license.

Acknowledgements

We thank the Huggingface team, in particular Younes Belkada, for their support integrating QLoRA with PEFT and transformers libraries.

We appreciate the work by many open-source contributors, especially:

Some lmm fine-tuning repos

Citation

Please cite the repo if you use the data or code in this repo.

@misc{Chinese-Guanaco,
  author = {jianzhnie},
  title = {LLamaTuner: Easy and Efficient Fine-tuning LLMs},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/jianzhnie/LLamaTuner}},
}