Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac).
- Supporting all Llama 2 models (7B, 13B, 70B, GPTQ, GGML, GGUF, CodeLlama) with 8-bit, 4-bit mode.
- Use llama2-wrapper as your local llama2 backend for Generative Agents/Apps; colab example.
- Run OpenAI Compatible API on Llama2 models.
- Supporting models: Llama-2-7b/13b/70b, Llama-2-GPTQ, Llama-2-GGML, Llama-2-GGUF, CodeLlama ...
- Supporting model backends: transformers, bitsandbytes(8-bit inference), AutoGPTQ(4-bit inference), llama.cpp
- Demos: Run Llama2 on MacBook Air; Run Llama2 on free Colab T4 GPU
- Use llama2-wrapper as your local llama2 backend for Generative Agents/Apps; colab example.
- Run OpenAI Compatible API on Llama2 models.
- News, Benchmark, Issue Solutions
Method 1: From PyPI
pip install llama2-wrapper
The newest llama2-wrapper>=0.1.14
supports llama.cpp's gguf
models.
If you would like to use old ggml
models, install llama2-wrapper<=0.1.13
or manually install llama-cpp-python==0.1.77
.
git clone https://github.com/liltom-eth/llama2-webui.git
cd llama2-webui
pip install -r requirements.txt
bitsandbytes >= 0.39
may not work on older NVIDIA GPUs. In that case, to use LOAD_IN_8BIT
, you may have to downgrade like this:
pip install bitsandbytes==0.38.1
bitsandbytes
also need a special install for Windows:
pip uninstall bitsandbytes
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.0-py3-none-win_amd64.whl
Run chatbot simply with web UI:
python app.py
app.py
will load the default config .env
which uses llama.cpp
as the backend to run llama-2-7b-chat.ggmlv3.q4_0.bin
model for inference. The model llama-2-7b-chat.ggmlv3.q4_0.bin
will be automatically downloaded.
Running on backend llama.cpp.
Use default model path: ./models/llama-2-7b-chat.Q4_0.gguf
Start downloading model to: ./models/llama-2-7b-chat.Q4_0.gguf
You can also customize your MODEL_PATH
, BACKEND_TYPE,
and model configs in .env
file to run different llama2 models on different backends (llama.cpp, transformers, gptq). You can use the --listen
to allow network requests, use the --port
to specify listening port.
We provide a code completion / filling UI for Code Llama.
Base model Code Llama and extend model Code Llama — Python are not fine-tuned to follow instructions. They should be prompted so that the expected answer is the natural continuation of the prompt. That means these two models focus on code filling and code completion.
Here is an example run CodeLlama code completion on llama.cpp backend:
python code_completion.py --model_path ./models/codellama-7b.Q4_0.gguf
codellama-7b.Q4_0.gguf
can be downloaded from TheBloke/CodeLlama-7B-GGUF.
Code Llama — Instruct trained with “natural language instruction” inputs paired with anticipated outputs. This strategic methodology enhances the model’s capacity to grasp human expectations in prompts. That means instruct models can be used in a chatbot-like app.
Example run CodeLlama chat on gptq backend:
python app.py --backend_type gptq --model_path ./models/CodeLlama-7B-Instruct-GPTQ/ --share True
CodeLlama-7B-Instruct-GPTQ
can be downloaded from TheBloke/CodeLlama-7B-Instruct-GPTQ
🔥 For developers, we released llama2-wrapper
as a llama2 backend wrapper in PYPI.
Use llama2-wrapper
as your local llama2 backend to answer questions and more, colab example:
# pip install llama2-wrapper
from llama2_wrapper import LLAMA2_WRAPPER, get_prompt
llama2_wrapper = LLAMA2_WRAPPER()
# Default running on backend llama.cpp.
# Automatically downloading model to: ./models/llama-2-7b-chat.ggmlv3.q4_0.bin
prompt = "Do you know Pytorch"
answer = llama2_wrapper(get_prompt(prompt), temperature=0.9)
Run gptq llama2 model on Nvidia GPU, colab example:
from llama2_wrapper import LLAMA2_WRAPPER
llama2_wrapper = LLAMA2_WRAPPER(backend_type="gptq")
# Automatically downloading model to: ./models/Llama-2-7b-Chat-GPTQ
Run llama2 7b with bitsandbytes 8 bit with a model_path
:
from llama2_wrapper import LLAMA2_WRAPPER
llama2_wrapper = LLAMA2_WRAPPER(
model_path = "./models/Llama-2-7b-chat-hf",
backend_type = "transformers",
load_in_8bit = True
)
Check API Document for more usages.
llama2-wrapper
offers a web server that acts as a drop-in replacement for the OpenAI API. This allows you to use Llama2 models with any OpenAI compatible clients, libraries or services, etc.
Start Fast API:
python -m llama2_wrapper.server
it will use llama.cpp
as the backend by default to run llama-2-7b-chat.ggmlv3.q4_0.bin
model.
Start Fast API for gptq
backend:
python -m llama2_wrapper.server --backend_type gptq
Navigate to http://localhost:8000/docs to see the OpenAPI documentation.
Flag | Description |
---|---|
-h , --help |
Show this help message. |
--model_path |
The path to the model to use for generating completions. |
--backend_type |
Backend for llama2, options: llama.cpp, gptq, transformers |
--max_tokens |
Maximum context size. |
--load_in_8bit |
Whether to use bitsandbytes to run model in 8 bit mode (only for transformers models). |
--verbose |
Whether to print verbose output to stderr. |
--host |
API address |
--port |
API port |
Run benchmark script to compute performance on your device, benchmark.py
will load the same .env
as app.py
.:
python benchmark.py
You can also select the iter
, backend_type
and model_path
the benchmark will be run (overwrite .env args) :
python benchmark.py --iter NB_OF_ITERATIONS --backend_type gptq
By default, the number of iterations is 5, but if you want a faster result or a more accurate one you can set it to whatever value you want, but please only report results with at least 5 iterations.
This colab example also show you how to benchmark gptq model on free Google Colab T4 GPU.
Some benchmark performance:
Model | Precision | Device | RAM / GPU VRAM | Speed (tokens/sec) | load time (s) |
---|---|---|---|---|---|
Llama-2-7b-chat-hf | 8 bit | NVIDIA RTX 2080 Ti | 7.7 GB VRAM | 3.76 | 641.36 |
Llama-2-7b-Chat-GPTQ | 4 bit | NVIDIA RTX 2080 Ti | 5.8 GB VRAM | 18.85 | 192.91 |
Llama-2-7b-Chat-GPTQ | 4 bit | Google Colab T4 | 5.8 GB VRAM | 18.19 | 37.44 |
llama-2-7b-chat.ggmlv3.q4_0 | 4 bit | Apple M1 Pro CPU | 5.4 GB RAM | 17.90 | 0.18 |
llama-2-7b-chat.ggmlv3.q4_0 | 4 bit | Apple M2 CPU | 5.4 GB RAM | 13.70 | 0.13 |
llama-2-7b-chat.ggmlv3.q4_0 | 4 bit | Apple M2 Metal | 5.4 GB RAM | 12.60 | 0.10 |
llama-2-7b-chat.ggmlv3.q2_K | 2 bit | Intel i7-8700 | 4.5 GB RAM | 7.88 | 31.90 |
Check/contribute the performance of your device in the full performance doc.
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
Llama-2-7b-Chat-GPTQ is the GPTQ model files for Meta's Llama 2 7b Chat. GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.
Model Name | set MODEL_PATH in .env | Download URL |
---|---|---|
meta-llama/Llama-2-7b-chat-hf | /path-to/Llama-2-7b-chat-hf | Link |
meta-llama/Llama-2-13b-chat-hf | /path-to/Llama-2-13b-chat-hf | Link |
meta-llama/Llama-2-70b-chat-hf | /path-to/Llama-2-70b-chat-hf | Link |
meta-llama/Llama-2-7b-hf | /path-to/Llama-2-7b-hf | Link |
meta-llama/Llama-2-13b-hf | /path-to/Llama-2-13b-hf | Link |
meta-llama/Llama-2-70b-hf | /path-to/Llama-2-70b-hf | Link |
TheBloke/Llama-2-7b-Chat-GPTQ | /path-to/Llama-2-7b-Chat-GPTQ | Link |
TheBloke/Llama-2-7b-Chat-GGUF | /path-to/llama-2-7b-chat.Q4_0.gguf | Link |
TheBloke/Llama-2-7B-Chat-GGML | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | Link |
TheBloke/CodeLlama-7B-Instruct-GPTQ | TheBloke/CodeLlama-7B-Instruct-GPTQ | Link |
... | ... | ... |
Running 4-bit model Llama-2-7b-Chat-GPTQ
needs GPU with 6GB VRAM.
Running 4-bit model llama-2-7b-chat.ggmlv3.q4_0.bin
needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from TheBloke/Llama-2-7B-Chat-GGML.
These models can be downloaded through:
python -m llama2_wrapper.download --repo_id TheBloke/CodeLlama-7B-Python-GPTQ
python -m llama2_wrapper.download --repo_id TheBloke/Llama-2-7b-Chat-GGUF --filename llama-2-7b-chat.Q4_0.gguf --save_dir ./models
Or use CMD like:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone git@hf.co:meta-llama/Llama-2-7b-chat-hf
To download Llama 2 models, you need to request access from https://ai.meta.com/llama/ and also enable access on repos like meta-llama/Llama-2-7b-chat-hf. Requests will be processed in hours.
For GPTQ models like TheBloke/Llama-2-7b-Chat-GPTQ, you can directly download without requesting access.
For GGML models like TheBloke/Llama-2-7B-Chat-GGML, you can directly download without requesting access.
There are some examples in ./env_examples/
folder.
Model Setup | Example .env |
---|---|
Llama-2-7b-chat-hf 8-bit (transformers backend) | .env.7b_8bit_example |
Llama-2-7b-Chat-GPTQ 4-bit (gptq transformers backend) | .env.7b_gptq_example |
Llama-2-7B-Chat-GGML 4bit (llama.cpp backend) | .env.7b_ggmlv3_q4_0_example |
Llama-2-13b-chat-hf (transformers backend) | .env.13b_example |
... | ... |
The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
If you do not have enough memory, you can set up your LOAD_IN_8BIT
as True
in .env
. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.
Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).
If you want to run 4 bit Llama-2 model like Llama-2-7b-Chat-GPTQ
, you can set up your BACKEND_TYPE
as gptq
in .env
like example .env.7b_gptq_example
.
Make sure you have downloaded the 4-bit model from Llama-2-7b-Chat-GPTQ
and set the MODEL_PATH
and arguments in .env
file.
Llama-2-7b-Chat-GPTQ
can run on a single GPU with 6 GB of VRAM.
If you encounter issue like NameError: name 'autogptq_cuda_256' is not defined
, please refer to here
If you have multiple GPUs, you need to set the maximum usage of memory for each GPU through the parameter --gptq_gpu_memory
. Otherwise, memory will only be allocated on the first GPU. If first GPU's memory is not enough, this will cause an error: torch.cuda.OutOfMemoryError : CUDA out of memory.
. An example running on 24GB memory dual GPUs: --gptq_gpu_memory "0:23GiB,1:23GiB"
.
Run Llama-2 model on CPU requires llama.cpp dependency and llama.cpp Python Bindings, which are already installed.
Download GGML models like llama-2-7b-chat.ggmlv3.q4_0.bin
following Download Llama-2 Models section. llama-2-7b-chat.ggmlv3.q4_0.bin
model requires at least 6 GB RAM to run on CPU.
Set up configs like .env.7b_ggmlv3_q4_0_example
from env_examples
as .env
.
Run web UI python app.py
.
For Mac users, you can also set up Mac Metal for acceleration, try install this dependencies:
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'
or check details:
If you would like to use AMD/Nvidia GPU for acceleration, check this:
MIT - see MIT License
This project enables users to adapt it freely for proprietary purposes without any restrictions.
Kindly read our Contributing Guide to learn and understand our development process.
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat
- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ
- https://github.com/ggerganov/llama.cpp
- https://github.com/TimDettmers/bitsandbytes
- https://github.com/PanQiWei/AutoGPTQ
- https://github.com/abetlen/llama-cpp-python