You need docker, docker-compose, and nvidia-docker installed.
You need a modern Nvidia GPU.
Open docker-compose.yml
and modify the command section to reflect what model you want to run.
There are several commented examples on how to the model with various methods.
There are many flags, but some of the more important ones are used in the command.
--model-id
determines what model we are running. It supports either a model on the huggingface hub, or a local model.
--huggingface-hub-cache
is used with along with a docker-compose volume to prevent the need to redownload models from the hub each time we want to use them.
--trust-remote-code
is required for some models to work. Just be sure that you can trust the model that you are running.
--quantize
is an optional flag that will reduce the memory reuqirements of the model and/or speed the model up. The arguments that is supports are either bitsandbytes
or gptq
. The former should work with almost all models, while the later need the model to be converted before it will work.
In my experience, GPTQ is faster, but bitsandbytes is easier and should have slightly better results due to the way it is done.
You will also want to change the section about GPU ids. Make sure that you are passing all the GPUs that you want as a list to the container
For any custom models that you want to run, put them inside the models
folder.
You will then give the value of something like /models/custom_model
for the --model-id
flag
After you have configured everything, you can run the API simply by running docker-compose up -d
To use the API, reading the docs will be useful. Most of the Huggingface Generate methods are supported.
For quick usage, you can look at query_api.py
and query_api_streaming.py
. Both take a txt file as input with the -f
flag.
The later program streams the token outputs as chatGPT does.
A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
- Quantization with :
- Safetensors weight loading
- Watermarking with A Watermark for Large Language Models
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
- Stop sequences
- Log probabilities
- Speculation ~2x latency
- Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
- Nvidia
- AMD (-rocm)
- Inferentia
- Intel GPU
- Gaudi
For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
And then you can make requests like
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all
flag and add --disable-custom-kernels
, please note CPU is not the intended platform for this project, so performance might be subpar.
Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model
instead of the command above.
To see all options to serve your models (in the code or in the cli):
text-generation-launcher --help
You can consult the OpenAPI documentation of the text-generation-inference
REST API using the /docs
route.
The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.
You have the option to utilize the HUGGING_FACE_HUB_TOKEN
environment variable for configuring the token employed by
text-generation-inference
. This allows you to gain access to protected resources.
For example, if you want to serve the gated Llama V2 model variants:
- Go to https://huggingface.co/settings/tokens
- Copy your cli READ token
- Export
HUGGING_FACE_HUB_TOKEN=<your cli READ token>
or with Docker:
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
NCCL
is a communication framework used by
PyTorch
to do distributed training/inference. text-generation-inference
make
use of NCCL
to enable Tensor Parallelism to dramatically speed up inference for large language models.
In order to share data between the different devices of a NCCL
group, NCCL
might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.
To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g
on the above command.
If you are running text-generation-inference
inside Kubernetes
. You can also add Shared Memory to the container by
creating a volume with:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 1Gi
and mounting it to /dev/shm
.
Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1
environment variable. However, note that
this will impact performance.
text-generation-inference
is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the --otlp-endpoint
argument.
You can also opt to install text-generation-inference
locally.
First install Rust and create a Python virtual environment with at least
Python 3.9, e.g. using conda
:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
conda create -n text-generation-inference python=3.11
conda activate text-generation-inference
You may also need to install Protoc.
On Linux:
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
On MacOS, using Homebrew:
brew install protobuf
Then run:
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
sudo apt-get install libssl-dev gcc -y
TGI works out of the box to serve optimized models for all modern models. They can be found in this list.
Other architectures are supported on a best-effort basis using:
AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")
or
AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4
or --quantize bitsandbytes-fp4
as a command line argument to text-generation-launcher
.
make server-dev
make router-dev
# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests