NEW! We're in 🤗 Huggingface's official docs! We're on the SFT docs and the DPO docs!
Supports Llama, Yi, Mistral, CodeLlama, Qwen (llamafied), Deepseek and their derived models (Open Hermes etc).
All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
NEW! Download 4 bit models 4x faster from 🤗 Huggingface! Eg: unsloth/mistral-7b-bnb-4bit
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
NEW! Want a UI for finetuning? Try Llama-Factory and use --use_unsloth!
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.
Find your CUDA version via
importtorch; torch.version.cuda
For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: got to step 3.
If you get errors, try the below first, then go back to step 1:
pip install --upgrade pip
Documentation
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗 Huggingface's official docs! We're on the SFT docs and the DPO docs!
fromunslothimportFastLanguageModelimporttorchfromtrlimportSFTTrainerfromtransformersimportTrainingArgumentsfromdatasetsimportload_datasetmax_seq_length=2048# Supports RoPE Scaling interally, so choose any!# Get LAION dataseturl="https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset=load_dataset("json", data_files= {"train" : url}, split="train")
# 4bit pre quantized models we support - 4x faster downloading!fourbit_models= [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
]
# Load Llama modelmodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=16,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=16,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimizeduse_gradient_checkpointing=True,
random_state=3407,
max_seq_length=max_seq_length,
)
trainer=SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=60,
fp16=nottorch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
output_dir="outputs",
optim="adamw_8bit",
seed=3407,
),
)
trainer.train()
DPO (Direct Preference Optimization) Support
DPO, PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗 Huggingface's official docs! We're on the SFT docs and the DPO docs!
fromunslothimportFastLanguageModel, PatchDPOTrainerPatchDPOTrainer()
importtorchfromtransformersimportTrainingArgumentsfromtrlimportDPOTrainermodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/zephyr-sft-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=64,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=64,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimizeduse_gradient_checkpointing=True,
random_state=3407,
max_seq_length=max_seq_length,
)
dpo_trainer=DPOTrainer(
model=model,
ref_model=None,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=3,
fp16=nottorch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
seed=42,
output_dir="outputs",
),
beta=0.1,
train_dataset=YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,tokenizer=tokenizer,
max_length=1024,
max_prompt_length=512,
)
dpo_trainer.train()
Support us!
We're currently 2 brothers trying to make LLMs for everyone! It'll be super cool if you can support our work!!
Future Milestones and limitations
Support Mixtral.
Supports all Mistral, Llama type models, but some are unoptimized (Qwen with biases)
Dropout, bias in LoRA matrices are supported, just not optimized.
Performance comparisons on 1 Tesla T4 GPU:
Time taken for 1 epoch
One Tesla T4 on Google Colab
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K)
Huggingface
1 T4
23h 15m
56h 28m
8h 38m
391h 41m
Unsloth Open
1 T4
13h 7m (1.8x)
31h 47m (1.8x)
4h 27m (1.9x)
240h 4m (1.6x)
Unsloth Pro
1 T4
3h 6m (7.5x)
5h 17m (10.7x)
1h 7m (7.7x)
59h 53m (6.5x)
Unsloth Max
1 T4
2h 39m (8.8x)
4h 31m (12.5x)
0h 58m (8.9x)
51h 30m (7.6x)
Peak Memory Usage
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K)
Huggingface
1 T4
7.3GB
5.9GB
14.0GB
13.3GB
Unsloth Open
1 T4
6.8GB
5.7GB
7.8GB
7.7GB
Unsloth Pro
1 T4
6.4GB
6.4GB
6.4GB
6.4GB
Unsloth Max
1 T4
11.4GB
12.4GB
11.9GB
14.4GB
Performance comparisons on 2 Tesla T4 GPUs via DDP:
Time taken for 1 epoch
Two Tesla T4s on Kaggle
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K) *
Huggingface
2 T4
84h 47m
163h 48m
30h 51m
1301h 24m *
Unsloth Pro
2 T4
3h 20m (25.4x)
5h 43m (28.7x)
1h 12m (25.7x)
71h 40m (18.1x) *
Unsloth Max
2 T4
3h 4m (27.6x)
5h 14m (31.3x)
1h 6m (28.1x)
54h 20m (23.9x) *
Peak Memory Usage on a Multi GPU System (2 GPUs)
System
GPU
Alpaca (52K)
LAION OIG (210K)
Open Assistant (10K)
SlimOrca (518K) *
Huggingface
2 T4
8.4GB | 6GB
7.2GB | 5.3GB
14.3GB | 6.6GB
10.9GB | 5.9GB *
Unsloth Pro
2 T4
7.7GB | 4.9GB
7.5GB | 4.9GB
8.5GB | 4.9GB
6.2GB | 4.7GB *
Unsloth Max
2 T4
10.5GB | 5GB
10.6GB | 5GB
10.6GB | 5GB
10.5GB | 5GB *
Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.
Llama-Factory 3rd party benchmarking
Method
Bits
TGS
GRAM
Speed
HF
16
2392
18GB
100%
HF+FA2
16
2954
17GB
123%
Unsloth+FA2
16
4007
16GB
168%
HF
4
2415
9GB
101%
Unsloth+FA2
4
3726
7GB
160%
Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
How did we make it faster?
Manual autograd, Triton kernels etc. See our Benchmark Breakdown for more info!
Troubleshooting
Sometimes bitsandbytes or xformers does not link properly. Try running:
!ldconfig /usr/lib64-nvidia
Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.
If it doesn't install - maybe try updating pip.
Full benchmarking tables
Click "Code" for a fully reproducible example.
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.