Llama 2 Fine-tuning / Inference Recipes and Examples

The 'llama-recipes' repository is a companion to the Llama 2 model. The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model here.

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well. For downloading the models, follow the instructions on Llama 2 repo.

Quick start
Model Conversion
Fine-tuning
Inference
Repository Organization
License and Acceptable Use Policy

Quick Start

Llama 2 Jupyter Notebook: This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the samsum. The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.

Note All the setting defined in config files can be passed as args through CLI when running the script, there is no need to change from config files directly.

Note In case need to run PEFT model with FSDP, please make sure to use the PyTorch Nightlies.

For more in depth information checkout the following:

Requirements

To run the examples, make sure to install the requirements using

# python 3.9 or higher recommended
pip install -r requirements.txt

Please note that the above requirements.txt will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.

Model conversion to Hugging Face

The recipes and notebooks in this folder are using the Llama 2 model definition provided by Hugging Face's transformers library.

Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with:

## Install HuggingFace Transformers from source
pip freeze | grep transformers ## verify it is version 4.31.0 or higher

git clone git@github.com:huggingface/transformers.git
cd transformers
pip install protobuf
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
   --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path

Fine-tuning

For fine-tuning Llama 2 models for your domain-specific use cases recipes for PEFT, FSDP, PEFT+FSDP have been included along with a few test datasets. For details see LLM Fine-tuning.

Single and Multi GPU Finetune

If you want to dive right into single or multi GPU fine-tuning, run the examples below on a single GPU like A10, T4, V100, A100 etc. All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand.

Note:

To change the dataset in the commands below pass the dataset arg. Current options for dataset are grammar_dataset, alpaca_datasetand samsum_dataset. A description of the datasets and how to add custom datasets can be found in Dataset.md. For grammar_dataset, alpaca_dataset please make sure you use the suggested instructions from here to set them up.
Default dataset and other LORA config has been set to samsum_dataset.
Make sure to set the right path to the model in the training config.

Single GPU:

#if running on multi-gpu machine
export CUDA_VISIBLE_DEVICES=0

python llama_finetuning.py  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model

Here we make use of Parameter Efficient Methods (PEFT) as described in the next section. To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.

Note if you are running on a machine with multiple GPUs please make sure to only make one of them visible using export CUDA_VISIBLE_DEVICES=GPU:id

Make sure you set save_model in training.py to save the model. Be sure to check the other training settings in train config as well as others in the config folder as needed or they can be passed as args to the training script as well.

Multiple GPUs One Node:

NOTE please make sure to use PyTorch Nightlies for using PEFT+FSDP. Also, note that int8 quantization from bit&bytes currently is not supported in FSDP.

torchrun --nnodes 1 --nproc_per_node 4  llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model

Here we use FSDP as discussed in the next section which can be used along with PEFT methods. To make use of PEFT methods with FSDP make sure to pass use_peft and peft_method args along with enable_fsdp. Here we are using BF16 for training.

Flash Attention and Xformer Memory Efficient Kernels

Setting use_fast_kernels will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in optimum library from HuggingFace as a one-liner API, please read more here.

torchrun --nnodes 1 --nproc_per_node 4  llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/model --use_fast_kernels

Fine-tuning using FSDP Only

If you are interested in running full parameter fine-tuning without making use of PEFT methods, please use the following command. Make sure to change the nproc_per_node to your available GPUs. This has been tested with BF16 on 8xA100, 40GB GPUs.

torchrun --nnodes 1 --nproc_per_node 8  llama_finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels

Fine-tuning using FSDP on 70B Model

If you are interested in running full parameter fine-tuning on the 70B model, you can enable low_cpu_fsdp mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with BF16 on 16xA100, 80GB GPUs.

torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned

Multi GPU Multi Node:

sbatch multi_node.slurm
# Change the num nodes and GPU per nodes in the script before running.

You can read more about our fine-tuning strategies here.

Repository Organization

This repository is organized in the following way:

configs: Contains the configuration files for PEFT methods, FSDP, Datasets.

docs: Example recipes for single and multi-gpu fine-tuning recipes.

ft_datasets: Contains individual scripts for each dataset to download and process. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

inference: Includes examples for inference for the fine-tuned models and how to use them safely.

model_checkpointing: Contains FSDP checkpoint handlers.

policies: Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode).

utils: Utility files for:

train_utils.py provides training/eval loop and more train utils.
dataset_utils.py to get preprocessed datasets.
config_utils.py to override the configs received from CLI.
fsdp_utils.py provides FSDP wrapping policy for PEFT methods.
memory_utils.py context manager to track different memory stats in train loop.

License

See the License file here and Acceptable Use Policy here

speediedan/llama-recipes