torchtune

Note

July 2024: torchtune has updated model weights for Llama3.1 in source and nightly builds! Check out our configs for both the 8B and 70B versions of the model. LoRA, QLoRA, and full finetune methods are supported. Support for QLoRA 405B will be added soon.

Introduction

torchtune is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. We're excited to announce our alpha release!

torchtune provides:

Native-PyTorch implementations of popular LLMs using composable and modular building blocks
Easy-to-use and hackable training recipes for popular fine-tuning techniques (LoRA, QLoRA) - no trainers, no frameworks, just PyTorch!
YAML configs for easily configuring training, evaluation, quantization or inference recipes
Built-in support for many popular dataset formats and prompt templates to help you quickly get started with training

torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:

Hugging Face Hub for accessing model weights
EleutherAI's LM Eval Harness for evaluating trained models
Hugging Face Datasets for access to training and evaluation datasets
PyTorch FSDP for distributed training
torchao for lower precision dtypes and post-training quantization techniques
Weights & Biases for logging metrics and checkpoints, and tracking training progress
Comet as another option for logging
ExecuTorch for on-device inference using fine-tuned models
bitsandbytes for low memory optimizers for our single-device recipes

Models

torchtune currently supports the following models.

Model	Sizes
Llama3.1	8B, 70B [models, configs]
Llama3	8B, 70B [models, configs]
Llama2	7B, 13B, 70B [models, configs]
Code-Llama2	7B, 13B, 70B [model, configs]
Mistral	7B [model, configs]
Gemma	2B, 7B [model, configs]
Microsoft Phi3	Mini [model, configs]
Qwen2	0.5B, 1.5B, 7B [model, configs]

We're always adding new models, but feel free to file an Issue if there's a new one you would love to see in torchtune!

Fine-tuning recipes

torchtune provides the following fine-tuning recipes.

Training	Fine-tuning Method
Distributed Training [1 to 8 GPUs]	Full [code, example], LoRA [code, example]
Single Device / Low Memory [1 GPU]	Full [code, example], LoRA + QLoRA [code, example]
Single Device [1 GPU]	DPO [code, example], RLHF with PPO [code, example]

Memory efficiency is important to us. All of our recipes are tested on a variety of setups including commodity GPUs with 24GB of VRAM as well as beefier options found in data centers.

Single-GPU recipes expose a number of memory optimizations that aren't available in the distributed versions. These include support for low-precision optimizers from bitsandbytes and fusing optimizer step with backward to reduce memory footprint from the gradients (see example config). For memory-constrained setups, we recommend using the single-device configs as a starting point.

This table captures the peak memory usage and training speed for recipes in torchtune.

Example HW Resources	Finetuning Method	Model	Setting	Peak Memory per GPU (GB)	Training Speed (tokens/sec)
1 x RTX 4090	QLoRA **	Llama2-7B	Batch Size = 4, Seq Length = 2048	12.3 GB	3155
1 x RTX 4090	LoRA	Llama2-7B	Batch Size = 4, Seq Length = 2048	21.3 GB	2582
2 x RTX 4090	LoRA	Llama2-7B	Batch Size = 4, Seq Length = 2048	16.2 GB	2768
1 x RTX 4090	Full finetune *	Llama2-7B	Batch Size = 4, Seq Length = 2048	24.1 GB	702
4 x RTX 4090	Full finetune	Llama2-7B	Batch Size = 4, Seq Length = 2048	24.1 GB	1388
8 x A100	LoRA	Llama2-70B	Batch Size = 4, Seq Length = 4096	26.4 GB	3384
8 x A100	Full Finetune *	Llama2-70B	Batch Size = 4, Seq Length = 4096	70.4 GB	2032

*= Uses PagedAdamW from bitsandbytes

**= Uses torch compile

Llama3 and Llama3.1

torchtune supports fine-tuning for the Llama3 8B and 70B size models. We currently support LoRA, QLoRA and full fine-tune on a single GPU as well as LoRA and full fine-tune on multiple devices for the 8B model, and LoRA on multiple devices for the 70B model. For all the details, take a look at our tutorial.

Note

Our Llama3 and Llama3.1 LoRA and QLoRA configs default to the instruct fine-tuned models. This is because not all special token embeddings are initialized in the base 8B and 70B models.

In our initial experiments for Llama3-8B, QLoRA has a peak allocated memory of ~9GB while LoRA on a single GPU has a peak allocated memory of ~19GB. To get started, you can use our default configs to kick off training.

Single GPU

LoRA 8B

tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

QLoRA 8B

tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device

Full 8B

tune run full_finetune_single_device --config llama3_1/8B_full_single_device

Multi GPU

Full 8B

tune run --nproc_per_node 4 full_finetune_distributed --config llama3_1/8B_full

LoRA 8B

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora

LoRA 70B

Note that the download command for the Meta-Llama3 70B model slightly differs from download commands for the 8B models. This is because we use the HuggingFace safetensor model format to load the model. To download the 70B model, run

tune download meta-llama/Meta-Llama-3.1-70b --hf-token <> --output-dir /tmp/Meta-Llama-3.1-70b --ignore-patterns "original/consolidated*"

Then, a finetune can be kicked off:

tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/70B_lora.yaml

You can find a full list of all our Llama3 configs here and Llama3.1 configs here.

Installation

Step 1: Install PyTorch. torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for fine-tuning multimodal LLMs and torchao for the latest in quantization techniques, you should install these as well.

# Install stable version of PyTorch libraries using pip
pip install torch torchvision torchao

# Nightly install for latest features
pip install --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu121

Step 2: The latest stable version of torchtune is hosted on PyPI and can be downloaded with the following command:

pip install torchtune

To confirm that the package is installed correctly, you can run the following command:

tune --help

And should see the following output:

usage: tune [-h] {ls,cp,download,run,validate} ...

Welcome to the torchtune CLI!

options:
  -h, --help            show this help message and exit

...

You can also install the latest and greatest torchtune has to offer by installing a nightly build.

Get Started

To get started with fine-tuning your first LLM with torchtune, see our tutorial on fine-tuning Llama2 7B. Our end-to-end workflow tutorial will show you how to evaluate, quantize and run inference with this model. The rest of this section will provide a quick overview of these steps with Llama2.

Downloading a model

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

Llama3 download

tune download meta-llama/Meta-Llama-3-8B \
--output-dir /tmp/Meta-Llama-3-8B \
--hf-token <HF_TOKEN> \

Tip

Set your environment variable HF_TOKEN or pass in --hf-token to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens

Running fine-tuning recipes

Llama3 8B + LoRA on single GPU:

tune run lora_finetune_single_device --config llama2/7B_lora_single_device

For distributed training, tune CLI integrates with torchrun. Llama3 8B + LoRA on two GPUs:

tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full

Tip

Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.

Modify Configs

There are two ways in which you can modify configs:

Config Overrides

You can easily overwrite config properties from the command-line:

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
batch_size=8 \
enable_activation_checkpointing=True \
max_steps_per_epoch=128

Update a Local Copy

You can also copy the config to your local directory and modify the contents directly:

tune cp llama2/7B_full ./my_custom_config.yaml
Copied to ./7B_full.yaml

Then, you can run your custom recipe by directing the tune run command to your local files:

tune run full_finetune_distributed --config ./my_custom_config.yaml

Check out tune --help for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.

Design Principles

torchtune embodies PyTorch’s design philosophy [details], especially "usability over everything else".

Native PyTorch

torchtune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (e.g. Hugging Face Datasets, EleutherAI Eval Harness), all of the core functionality is written in PyTorch.

Simplicity and Extensibility

torchtune is designed to be easy to understand, use and extend.

Composition over implementation inheritance - layers of inheritance for code re-use makes the code hard to read and extend
No training frameworks - explicitly outlining the training logic makes it easy to extend for custom use cases
Code duplication is preferred over unnecessary abstractions
Modular building blocks over monolithic components

Correctness

torchtune provides well-tested components with a high-bar on correctness. The library will never be the first to provide a feature, but available features will be thoroughly tested. We provide

Extensive unit-tests to ensure component-level numerical parity with reference implementations
Checkpoint-tests to ensure model-level numerical parity with reference implementations
Integration tests to ensure recipe-level performance parity with reference implementations on standard benchmarks

Community Contributions

We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide.

@SalmanMohammadi for adding a comprehensive end-to-end recipe for Reinforcement Learning from Human Feedback (RLHF) finetuning with PPO to torchtune
@fyabc for adding Qwen2 models, tokenizer, and recipe integration to torchtune
@solitude-alive for adding the Gemma 2B model to torchtune, including recipe changes, numeric validations of the models and recipe correctness
@yechenzhi for adding Direct Preference Optimization (DPO) to torchtune, including the recipe and config along with correctness checks

Acknowledgements

The Llama2 code in this repository is inspired by the original Llama2 code.

We want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune.

We also want to acknowledge some awesome libraries and tools from the ecosystem:

gpt-fast for performant LLM inference techniques which we've adopted OOTB
llama recipes for spring-boarding the llama2 community
bitsandbytes for bringing several memory and performance based techniques to the PyTorch ecosystem
@winglian and axolotl for early feedback and brainstorming on torchtune's design and feature set.
lit-gpt for pushing the LLM fine-tuning community forward.
HF TRL for making reward modeling more accessible to the PyTorch community.

License

torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

yf225/torchtune