LLaMA-Factory: A Python repository from Kimizhao

👋 Join our WeChat.

[ English | 中文 ]

LLaMA Board: A One-stop Web UI for Getting Started with LLaMA Factory

Preview LLaMA Board at 🤗 Spaces or ModelScope.

Launch LLaMA Board via CUDA_VISIBLE_DEVICES=0 python src/train_web.py. (multiple GPUs are not supported yet in this mode)

Here is an example of altering the self-cognition of an instruction-tuned language model within 10 minutes on a single GPU.

tutorial.mp4

Benchmark
Changelog
Supported Models
Supported Training Approaches
Provided Datasets
Requirement
Getting Started
Projects using LLaMA Factory
License
Citation
Acknowledgement

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA-Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA-Factory's QLoRA further improves the efficiency regarding the GPU memory.

Definitions

Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA-Factory's LoRA tuning.

Changelog

[24/02/15] We supported block expansion proposed by LLaMA Pro. See tests/llama_pro.py for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with --dataset glaive_toolcall.

Full Changelog

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try --use_unsloth argument to activate unsloth patch. It achieves 1.7x speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.

[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub for Chinese mainland users. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try --neftune_noise_alpha argument to activate NEFTune, e.g., --neftune_noise_alpha 5.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try --shift_attn argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See this example to evaluate your models.

[23/09/10] We supported FlashAttention-2. Try --flash_attn argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try --rope_scaling linear argument in training and --rope_scaling dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See this example to train your models.

[23/07/31] We supported dataset streaming. Try --streaming and --max_steps 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). Try --quantization_bit 4/8 argument to work with quantized models.

Supported Models

Model	Model size	Default module	Template
Baichuan2	7B/13B	W_pack	baichuan2
BLOOM	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
BLOOMZ	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
ChatGLM3	6B	query_key_value	chatglm3
DeepSeek (MoE)	7B/16B/67B	q_proj,v_proj	deepseek
Falcon	7B/40B/180B	query_key_value	falcon
Gemma	2B/7B	q_proj,v_proj	gemma
InternLM2	7B/20B	wqkv	intern2
LLaMA	7B/13B/33B/65B	q_proj,v_proj	-
LLaMA-2	7B/13B/70B	q_proj,v_proj	llama2
Mistral	7B	q_proj,v_proj	mistral
Mixtral	8x7B	q_proj,v_proj	mistral
Phi-1.5/2	1.3B/2.7B	q_proj,v_proj	-
Qwen	1.8B/7B/14B/72B	c_attn	qwen
Qwen1.5	0.5B/1.8B/4B/7B/14B/72B	q_proj,v_proj	qwen
XVERSE	7B/13B/65B	q_proj,v_proj	xverse
Yi	6B/34B	q_proj,v_proj	yi
Yuan	2B/51B/102B	q_proj,v_proj	yuan

Note

Default module is used for the --lora_target argument, you can use --lora_target all to specify all the available modules.

For the "base" models, the --template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "chat" models.

Please refer to constants.py for a full list of models we supported.

Supported Training Approaches

Approach	Full-parameter	Partial-parameter	LoRA	QLoRA
Pre-Training	✅	✅	✅	✅
Supervised Fine-Tuning	✅	✅	✅	✅
Reward Modeling	✅	✅	✅	✅
PPO Training	✅	✅	✅	✅
DPO Training	✅	✅	✅	✅

Note

Use --quantization_bit 4 argument to enable QLoRA.

Provided Datasets

Pre-training datasets

Supervised fine-tuning datasets

Preference datasets

Please refer to data/README.md for details.

Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Requirement

Python 3.8+ and PyTorch 1.13.1+
🤗Transformers, Datasets, Accelerate, PEFT and TRL
sentencepiece, protobuf and tiktoken
jieba, rouge-chinese and nltk (used at evaluation and predict)
gradio and matplotlib (used in web UI)
uvicorn, fastapi and sse-starlette (used in API)

Hardware Requirement

Method	Bits	7B	13B	30B	65B	8x7B
Full	16	160GB	320GB	600GB	1200GB	900GB
Freeze	16	20GB	40GB	120GB	240GB	200GB
LoRA	16	16GB	32GB	80GB	160GB	120GB
QLoRA	8	10GB	16GB	40GB	80GB	80GB
QLoRA	4	6GB	12GB	24GB	48GB	32GB

Getting Started

Data Preparation (optional)

Please refer to data/README.md for checking the details about the format of dataset files. You can either use a single .json file or a dataset loading script with multiple files to create a custom dataset.

Note

Please update data/dataset_info.json to use your custom dataset. About the format of this file, please refer to data/README.md.

Dependence Installation (optional)

git clone https://github.com/hiyouga/LLaMA-Factory.git
conda create -n llama_factory python=3.10
conda activate llama_factory
cd LLaMA-Factory
pip install -r requirements.txt

If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of bitsandbytes library, which supports CUDA 11.1 to 12.2.

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.0-py3-none-win_amd64.whl

To enable FlashAttention-2 on the Windows platform, you need to install the precompiled flash-attn library, which supports CUDA 12.1 to 12.2. Please download the corresponding version from flash-attention based on your requirements.

Use ModelScope Hub (optional)

If you have trouble with downloading models and datasets from Hugging Face, you can use LLaMA-Factory together with ModelScope in the following manner.

export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows

Then you can train the corresponding model by specifying a model ID of the ModelScope Hub. (find a full list of model IDs at ModelScope Hub)

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --model_name_or_path modelscope/Llama-2-7b-ms \
    ... # arguments (same as above)

LLaMA Board also supports using the models and datasets on the ModelScope Hub.

CUDA_VISIBLE_DEVICES=0 USE_MODELSCOPE_HUB=1 python src/train_web.py

Train on a single GPU

Important

If you want to train models on multiple GPUs, please refer to Distributed Training.

Pre-Training

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage pt \
    --do_train \
    --model_name_or_path path_to_llama_model \
    --dataset wiki_demo \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_pt_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

Supervised Fine-Tuning

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path path_to_llama_model \
    --dataset alpaca_gpt4_en \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

Reward Modeling

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage rm \
    --do_train \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_sft_checkpoint \
    --create_new_adapter \
    --dataset comparison_gpt4_en \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_rm_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-6 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

PPO Training

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage ppo \
    --do_train \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_sft_checkpoint \
    --create_new_adapter \
    --dataset alpaca_gpt4_en \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --reward_model path_to_rm_checkpoint \
    --output_dir path_to_ppo_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --top_k 0 \
    --top_p 0.9 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

Warning

Use --per_device_train_batch_size=1 for LLaMA-2 models in fp16 PPO training.

DPO Training

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage dpo \
    --do_train \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_sft_checkpoint \
    --create_new_adapter \
    --dataset comparison_gpt4_en \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_dpo_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

Distributed Training

Use Huggingface Accelerate

accelerate config # configure the environment
accelerate launch src/train_bash.py # arguments (same as above)

Example config for LoRA training

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Use DeepSpeed

deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \
    --deepspeed ds_config.json \
    ... # arguments (same as above)

Example config for full-parameter training with DeepSpeed ZeRO-2

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": false,
    "contiguous_gradients": true
  }
}

Merge LoRA weights and export model

python src/export_model.py \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --template default \
    --finetuning_type lora \
    --export_dir path_to_export \
    --export_size 2 \
    --export_legacy_format False

Warning

Merging LoRA weights into a quantized model is not supported.

Tip

Use --export_quantization_bit 4 and --export_quantization_dataset data/c4_demo.json to quantize the model after merging the LoRA weights.

Inference with OpenAI-style API

python src/api_demo.py \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --template default \
    --finetuning_type lora

Tip

Visit http://localhost:8000/docs for API documentation.

Inference with command line

python src/cli_demo.py \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --template default \
    --finetuning_type lora

Inference with web browser

python src/web_demo.py \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --template default \
    --finetuning_type lora

Evaluation

CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --template vanilla \
    --finetuning_type lora \
    --task mmlu \
    --split test \
    --lang en \
    --n_shot 5 \
    --batch_size 4

Predict

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_predict \
    --model_name_or_path path_to_llama_model \
    --adapter_name_or_path path_to_checkpoint \
    --dataset alpaca_gpt4_en \
    --template default \
    --finetuning_type lora \
    --output_dir path_to_predict_result \
    --per_device_eval_batch_size 1 \
    --max_samples 100 \
    --predict_with_generate \
    --fp16

Warning

Use --per_device_train_batch_size=1 for LLaMA-2 models in fp16 predict.

Tip

We recommend using --per_device_eval_batch_size=1 and --max_target_length 128 at 4/8-bit predict.

Projects using LLaMA Factory

Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [arxiv]
Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. 2024. [arxiv]
Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. 2024. [arxiv]
Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.

Tip

If you have a project that should be incorporated, please contact via email or create a pull request.

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: Baichuan2 / BLOOM / ChatGLM3 / DeepSeek / Falcon / Gemma / InternLM2 / LLaMA / LLaMA-2 / Mistral / Phi-1.5/2 / Qwen / XVERSE / Yi / Yuan

Citation

If this work is helpful, please kindly cite as:

@Misc{llama-factory,
  title = {LLaMA Factory},
  author = {hiyouga},
  howpublished = {\url{https://github.com/hiyouga/LLaMA-Factory}},
  year = {2023}
}

Acknowledgement

This repo benefits from PEFT, QLoRA and FastChat. Thanks for their wonderful works.

Kimizhao/LLaMA-Factory

LLaMA Board: A One-stop Web UI for Getting Started with LLaMA Factory

Table of Contents

Benchmark

Changelog

Supported Models

Supported Training Approaches

Provided Datasets

Requirement

Hardware Requirement

Getting Started

Data Preparation (optional)

Dependence Installation (optional)

Use ModelScope Hub (optional)

Train on a single GPU

Pre-Training

Supervised Fine-Tuning

Reward Modeling

PPO Training

DPO Training

Distributed Training

Use Huggingface Accelerate

Use DeepSpeed

Merge LoRA weights and export model

Inference with OpenAI-style API

Inference with command line

Inference with web browser

Evaluation

Predict

Projects using LLaMA Factory

License

Citation

Acknowledgement

Star History