Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory
PythonApache-2.0
Finetune Llama 3, Mistral & Gemma 2-5x faster with 80% less memory!
✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
📣 NEW! We cut memory usage by a further 30% and now support fine-tuning of LLMs with 4x longer context windows! No change required if you're using our notebooks. To enable, simply change 1 line:
All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!
🥇 Performance Benchmarking
For the full list of reproducable benchmarking tables, go to our website
Benchmarking table below was conducted by 🤗Hugging Face.
Free Colab T4
Dataset
🤗Hugging Face
Pytorch 2.1.1
🦥Unsloth
🦥 VRAM reduction
Llama-2 7b
OASST
1x
1.19x
1.95x
-43.3%
Mistral 7b
Alpaca
1x
1.07x
1.56x
-13.7%
Tiny Llama 1.1b
Alpaca
1x
2.06x
3.87x
-73.8%
DPO with Zephyr
Ultra Chat
1x
1.09x
1.55x
-18.6%
💾 Installation Instructions
Conda Installation
Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. See this Github issue for help on debugging Conda installs.
Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.
Find your CUDA version via
importtorch; torch.version.cuda
For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.
Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
fromunslothimportFastLanguageModelimporttorchfromtrlimportSFTTrainerfromtransformersimportTrainingArgumentsfromdatasetsimportload_datasetmax_seq_length=2048# Supports RoPE Scaling interally, so choose any!# Get LAION dataseturl="https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset=load_dataset("json", data_files= {"train" : url}, split="train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.fourbit_models= [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit",
"unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b"unsloth/gemma-2b-bnb-4bit",
"unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b"unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3"unsloth/Phi-3-mini-4k-instruct-bnb-4bit",
] # More models at https://huggingface.co/unslothmodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=16,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=16,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
use_rslora=False, # We support rank stabilized LoRAloftq_config=None, # And LoftQ
)
trainer=SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=60,
fp16=nottorch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
output_dir="outputs",
optim="adamw_8bit",
seed=3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like# (1) Saving to GGUF / merging to 16bit for vLLM# (2) Continued training from a saved LoRA adapter# (3) Adding an evaluation loop / OOMs# (4) Cutomized chat templates
DPO Support
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!
fromunslothimportFastLanguageModel, PatchDPOTrainerPatchDPOTrainer()
importtorchfromtransformersimportTrainingArgumentsfromtrlimportDPOTrainermodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/zephyr-sft-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=64,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=64,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
)
dpo_trainer=DPOTrainer(
model=model,
ref_model=None,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=3,
fp16=nottorch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
seed=42,
output_dir="outputs",
),
beta=0.1,
train_dataset=YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,tokenizer=tokenizer,
max_length=1024,
max_prompt_length=512,
)
dpo_trainer.train()
🥇 Detailed Benchmarking Tables
Click "Code" for fully reproducible examples
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.