/TODO-LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs

Primary LanguagePythonApache License 2.0Apache-2.0

# LLaMA Factory

GitHub Repo stars GitHub Code License GitHub last commit PyPI Citation GitHub pull request Discord Twitter Spaces Studios Open in Colab

GitHub Tread

πŸ‘‹ Join our WeChat.

[ English | δΈ­ζ–‡ ]

Fine-tuning a large language model can be easy as...

tutorial_en.mp4

Choose your path:

Table of Contents

Features

  • Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
  • Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO and ORPO.
  • Scalable resources: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
  • Advanced algorithms: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and Agent tuning.
  • Practical tricks: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
  • Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
  • Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker.

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

benchmark

Definitions
  • Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
  • Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
  • GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
  • We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA Factory's LoRA tuning.

Changelog

[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with gemma template for chat completion.

[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.

Full Changelog

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.

[24/04/16] We supported BAdam. See examples for usage.

[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

[24/03/31] We supported ORPO. See examples for usage.

[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.

[24/03/13] We supported LoRA+. See examples for usage.

[24/03/07] We supported gradient low-rank projection (GaLore) algorithm. See examples for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm to enjoy 270% inference speed.

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall.

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.

[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub for Chinese mainland users. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5 argument to activate NEFTune.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.

[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.

[23/07/31] We supported dataset streaming. Try streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚑🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.

Supported Models

Model Model size Default module Template
Baichuan2 7B/13B W_pack baichuan2
BLOOM 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
BLOOMZ 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
ChatGLM3 6B query_key_value chatglm3
Command-R 35B/104B q_proj,v_proj cohere
DeepSeek (MoE) 7B/16B/67B/236B q_proj,v_proj deepseek
Falcon 7B/11B/40B/180B query_key_value falcon
Gemma/CodeGemma 2B/7B q_proj,v_proj gemma
InternLM2 7B/20B wqkv intern2
LLaMA 7B/13B/33B/65B q_proj,v_proj -
LLaMA-2 7B/13B/70B q_proj,v_proj llama2
LLaMA-3 8B/70B q_proj,v_proj llama3
LLaVA-1.5 7B/13B q_proj,v_proj vicuna
Mistral/Mixtral 7B/8x7B/8x22B q_proj,v_proj mistral
OLMo 1B/7B q_proj,v_proj -
PaliGemma 3B q_proj,v_proj gemma
Phi-1.5/2 1.3B/2.7B q_proj,v_proj -
Phi-3 3.8B qkv_proj phi
Qwen 1.8B/7B/14B/72B c_attn qwen
Qwen1.5 (Code/MoE) 0.5B/1.8B/4B/7B/14B/32B/72B/110B q_proj,v_proj qwen
StarCoder2 3B/7B/15B q_proj,v_proj -
XVERSE 7B/13B/65B q_proj,v_proj xverse
Yi (1/1.5) 6B/9B/34B q_proj,v_proj yi
Yi-VL 6B/34B q_proj,v_proj yi_vl
Yuan 2B/51B/102B q_proj,v_proj yuan

Note

Default module is used for the --lora_target argument, you can use --lora_target all to specify all the available modules for better convergence.

For the "base" models, the --template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "instruct/chat" models.

Remember to use the SAME template in training and inference.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

Approach Full-tuning Freeze-tuning LoRA QLoRA
Pre-Training βœ… βœ… βœ… βœ…
Supervised Fine-Tuning βœ… βœ… βœ… βœ…
Reward Modeling βœ… βœ… βœ… βœ…
PPO Training βœ… βœ… βœ… βœ…
DPO Training βœ… βœ… βœ… βœ…
KTO Training βœ… βœ… βœ… βœ…
ORPO Training βœ… βœ… βœ… βœ…

Provided Datasets

Pre-training datasets
Supervised fine-tuning datasets
Preference datasets

Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Requirement

Mandatory Minimum Recommend
python 3.8 3.10
torch 1.13.1 2.2.0
transformers 4.37.2 4.41.0
datasets 2.14.3 2.19.1
accelerate 0.27.2 0.30.1
peft 0.9.0 0.11.1
trl 0.8.2 0.8.6
Optional Minimum Recommend
CUDA 11.6 12.2
deepspeed 0.10.0 0.14.0
bitsandbytes 0.39.0 0.43.1
vllm 0.4.0 0.4.2
flash-attn 2.3.0 2.5.8

Hardware Requirement

* estimated

Method Bits 7B 13B 30B 70B 110B 8x7B 8x22B
Full AMP 120GB 240GB 600GB 1200GB 2000GB 900GB 2400GB
Full 16 60GB 120GB 300GB 600GB 900GB 400GB 1200GB
Freeze 16 20GB 40GB 80GB 200GB 360GB 160GB 400GB
LoRA/GaLore/BAdam 16 16GB 32GB 64GB 160GB 240GB 120GB 320GB
QLoRA 8 10GB 20GB 40GB 80GB 140GB 60GB 160GB
QLoRA 4 6GB 12GB 24GB 48GB 72GB 30GB 96GB
QLoRA 2 4GB 8GB 16GB 24GB 48GB 18GB 48GB

Getting Started

Installation

Important

Installation is mandatory.

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .[torch,metrics]

Extra dependencies available: torch, metrics, deepspeed, bitsandbytes, vllm, galore, badam, gptq, awq, aqlm, qwen, modelscope, quality

Tip

Use pip install --no-deps -e . to resolve package conflicts.

For Windows users

If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of bitsandbytes library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

To enable FlashAttention-2 on the Windows platform, you need to install the precompiled flash-attn library, which supports CUDA 12.1 to 12.2. Please download the corresponding version from flash-attention based on your requirements.

For Ascend NPU users

Join NPU user group.

To utilize Ascend NPU devices for (distributed) training and inference, you need to install the torch-npu library and the Ascend CANN Kernels.

Requirement Minimum Recommend
CANN 8.0.RC1 8.0.RC1
torch 2.2.0 2.2.0
torch-npu 2.2.0 2.2.0
deepspeed 0.13.2 0.13.2

Docker image:

Remember to use ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify the device to use.

If you cannot infer model on NPU devices, try setting do_sample: false in the configurations.

Data Preparation

Please refer to data/README.md for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.

Note

Please update data/dataset_info.json to use your custom dataset.

Quickstart

Use the following 3 commands to run LoRA fine-tuning, inference and merging of the Llama3-8B-Instruct model, respectively.

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

See examples/README.md for advanced usage (including distributed training).

Tip

Use llamafactory-cli help to show help information.

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

Important

LLaMA Board GUI only supports training on a single GPU.

Use local environment

CUDA_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui
For Alibaba Cloud PAI or AutoDL users

If you encountered display problems in LLaMA Board on Alibaba Cloud PAI, try using the following command to set environment variables before starting LLaMA Board:

export GRADIO_SERVER_PORT=7860 GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/

If you are using AutoDL, please install a specific version of Gradio:

pip install gradio==4.10.0

Use Docker

docker build -f ./Dockerfile -t llama-factory:latest .
docker run --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface/ \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -e CUDA_VISIBLE_DEVICES=0 \
    -p 7860:7860 \
    --shm-size 16G \
    --name llama_factory \
    -d llama-factory:latest

Use Docker Compose

docker compose -f ./docker-compose.yml up -d
Details about volume
  • hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
  • data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
  • output: Set export dir to this location so that the merged result can be accessed directly on the host machine.

Deploy with OpenAI-style API and vLLM

CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml

Download from ModelScope Hub

If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows

Train the model by specifying a model ID of the ModelScope Hub as the --model_name_or_path. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct.

Projects using LLaMA Factory

If you have a project that should be incorporated, please contact via email or create a pull request.

Click to show
  1. Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
  2. Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
  3. Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
  4. Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
  5. Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
  6. Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [arxiv]
  7. Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. 2024. [arxiv]
  8. Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
  9. Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
  10. Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
  11. Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
  12. Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
  13. Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
  14. Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. 2024. [arxiv]
  15. Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
  16. Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
  17. Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
  18. Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. 2024. [arxiv]
  19. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
  20. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
  21. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
  22. Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [arxiv]
  23. Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
  24. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
  25. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [arxiv]
  26. Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
  27. Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
  28. Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
  29. Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
  30. Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. 2024. [arxiv]
  31. Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
  32. Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [arxiv]
  33. Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [arxiv]
  34. Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [arxiv]
  35. Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [arxiv]
  36. Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. 2024. [arxiv]
  37. Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. 2024. [arxiv]
  38. StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
  39. DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
  40. Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
  41. CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
  42. MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
  43. Luminia-13B-v3: A large language model specialized in generate metadata for stable diffusion. [πŸ€—Demo]
  44. Chinese-LLaVA-Med: A multimodal large language model specialized in Chinese medical domain, based on LLaVA-1.5-7B.

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: Baichuan2 / BLOOM / ChatGLM3 / Command-R / DeepSeek / Falcon / Gemma / InternLM2 / LLaMA / LLaMA-2 (LLaVA-1.5) / LLaMA-3 / Mistral / OLMo / Phi-1.5/2 / Phi-3 / Qwen / StarCoder2 / XVERSE / Yi / Yi-1.5 / Yuan

Citation

If this work is helpful, please kindly cite as:

@article{zheng2024llamafactory,
  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Yongqiang Ma},
  journal={arXiv preprint arXiv:2403.13372},
  year={2024},
  url={http://arxiv.org/abs/2403.13372}
}

Acknowledgement

This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.

Star History

Star History Chart