Hugging Face Llama Recipes

thumbnail for repository

🤗🦙Welcome! This repository contains minimal recipes to get started quickly with Llama 3.x models, including Llama 3.1, Llama 3.2, and Llama 3.3.

This repository is WIP so that you might see considerable changes in the coming days.

Note

To use Llama 3.x, you need to accept the license and request permission to access the models. Please visit the Hugging Face repos and submit your request. You only need to do this once per collection; you'll get access to all the repos in the collection if your request is approved.

Getting Started

The easiest way to quickly run a Llama 🦙 on your machine would be with the 🤗 transformers repository. Make sure you have the latest release installed.

$ pip install -U transformers

Let us conversate with an instruction tuned model.

import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

llama_31 = "meta-llama/Llama-3.1-8B-Instruct" # <-- llama 3.1
llama_32 = "meta-llama/Llama-3.2-3B-Instruct" # <-- llama 3.2

prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

generator = pipeline(model=llama_32, device=device, torch_dtype=torch.bfloat16)
generation = generator(
    prompt,
    do_sample=False,
    temperature=1.0,
    top_p=1,
    max_new_tokens=50
)

print(f"Generation: {generation[0]['generated_text']}")

# Generation:
# [
#   {'role': 'system', 'content': 'You are a helpful assistant, that responds as a pirate.'},
#   {'role': 'user', 'content': "What's Deep Learning?"},
#   {'role': 'assistant', 'content': "Yer lookin' fer a treasure trove o'
#             knowledge on Deep Learnin', eh? Alright then, listen close and
#             I'll tell ye about it.\n\nDeep Learnin' be a type o' machine
#             learnin' that uses neural networks"}
# ]

Local Inference

Would you like to run inference of the Llama models locally? So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:

Model Size Llama Variant BF16/FP16 FP8 INT4(AWQ/GPTQ/bnb)
1B 3.2 2.5 GB 1.25GB 0.75GB
3B 3.2 6.5 GB 3.2GB 1.75GB
8B 3.1 16 GB 8GB 4GB
70B 3.1 and 3.3 140 GB 70GB 35GB
405B 3.1 810 GB 405GB 204GB

Note

These are estimated values and may vary based on specific implementation details and optimizations.

Working with the capable Llama 3.1 8B models:

Working on the 🐘 big Llama 3.1 405B model:

Model Fine Tuning:

It is often not enough to run inference on the model. Many times, you need to fine-tune the model on some custom dataset. Here are some scripts showing how to fine-tune the models.

Fine tune models on your custom dataset:

Assisted Decoding Techniques

Do you want to use the smaller Llama 3.2 models to speed up text generation for bigger models? These notebooks showcase assisted decoding (speculative decoding), which gives you upto 2x speedups for text generation on Llama 3.1 70B (with greedy decoding).

Performance Optimization

Let us optimize performace shall we?

API inference

Are these models too large for you to run at home? Would you like to experiment with Llama 70B? Try out the following examples!

Llama Guard and Prompt Guard

In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks:

Synthetic Data Generation

With the ever hungry models, the need for synthetic data generation is on the rise. Here we show you how to build your very own synthetic dataset.

Llama RAG

Seeking an entry-level RAG pipeline? This notebook guides you through building a very simple streamlined RAG experiment using Llama and Hugging Face.

Text Generation Inference (TGI) & API Inference with Llama Models

Text Generation Inference (TGI) framework enables efficient and scalable deployment of Llama models. In this notebook we'll learn how to integrate TGI for fast text generation and to consume already deployed Llama models via the Inference API:

Chatbot Demo with Llama Models

Would you like to build a chatbot with Llama models? Here's a simple example to get you started.

Tool Calling

In this notebook, we explore how to leverage the tool-calling capabilities of Llama models, also using and integrating the chat_template functionality for tool interactions.