Architecture | Results | Examples | Documentation
- TensorRT-LLM Overview
- Installation
- Quick Start
- Support Matrix
- Performance
- Advanced Topics
- Troubleshooting
- Release Notes
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).
The Python API of TensorRT-LLM is architectured to look similar to the
PyTorch API. It provides users with a
functional module containing functions like
einsum
, softmax
, matmul
or view
. The layers
module bundles useful building blocks to assemble LLMs; like an Attention
block, a MLP
or the entire Transformer
layer. Model-specific components,
like GPTAttention
or BertAttention
, can be found in the
models module.
TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. See below for a list of supported models.
To maximize performance and reduce memory footprint, TensorRT-LLM allows the
models to be executed using different quantization modes (see
examples/gpt
for concrete examples). TensorRT-LLM supports
INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as
well as a complete implementation of the
SmoothQuant technique.
For a more detailed presentation of the software architecture and the key concepts used in TensorRT-LLM, we recommend you to read the following document.
For Windows installation, see Windows/
.
TensorRT-LLM must be built from source, instructions can be found here. An image of a Docker container with TensorRT-LLM and its Triton Inference Server Backend will be made available soon.
The remaining commands in that document must be executed from the TensorRT-LLM container.
To create a TensorRT engine for an existing model, there are 3 steps:
- Download pre-trained weights,
- Build a fully-optimized engine of the model,
- Deploy the engine.
The following sections show how to use TensorRT-LLM to run the BLOOM-560m model.
0. In the BLOOM folder
Inside the Docker container, you have to install the requirements:
pip install -r examples/bloom/requirements.txt
git lfs install
1. Download the model weights from HuggingFace
From the BLOOM example folder, you must download the weights of the model.
cd examples/bloom
rm -rf ./bloom/560M
mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M
2. Build the engine
# Single GPU on BLOOM 560M
python build.py --model_dir ./bloom/560M/ \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
See the BLOOM example for more details and options regarding the build.py
script.
3. Run
The summarize.py
script can be used to perform the summarization of articles
from the CNN Daily dataset:
python summarize.py --test_trt_llm \
--hf_model_location ./bloom/560M/ \
--data_type fp16 \
--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
More details about the script and how to run the BLOOM model can be found in the example folder. Many more models than BLOOM are implemented in TensorRT-LLM. They can be found in the examples directory.
TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.
TensorRT-LLM is rigorously tested on the following GPUs:
If a GPU is not listed above, it is important to note that TensorRT-LLM is expected to work on GPUs based on the Volta, Turing, Ampere, Hopper and Ada Lovelace architectures. Certain limitations may, however, apply.
Various numerical precisions are supported in TensorRT-LLM. The support for some of those numerical features require specific architectures:
FP32 | FP16 | BF16 | FP8 | INT8 | INT4 | |
---|---|---|---|---|---|---|
Volta (SM70) | Y | Y | N | N | Y | Y |
Turing (SM75) | Y | Y | N | N | Y | Y |
Ampere (SM80, SM86) | Y | Y | Y | N | Y | Y |
Ada-Lovelace (SM89) | Y | Y | Y | Y | Y | Y |
Hopper (SM90) | Y | Y | Y | Y | Y | Y |
In this release of TensorRT-LLM, the support for FP8 and quantized data types (INT8 or INT4) is not implemented for all the models. See the precision document and the examples folder for additional details.
TensorRT-LLM contains examples that implement the following features.
- Multi-head Attention(MHA)
- Multi-query Attention (MQA)
- Group-query Attention(GQA)
- In-flight Batching
- Paged KV Cache for the Attention
- Tensor Parallelism
- Pipeline Parallelism
- INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
- SmoothQuant
- GPTQ
- AWQ
- FP8
- Greedy-search
- Beam-search
- RoPE
In this release of TensorRT-LLM, some of the features are not enabled for all the models listed in the examples folder.
The list of supported models is:
- Baichuan
- Bert
- Blip2
- BLOOM
- ChatGLM-6B
- ChatGLM2-6B
- Falcon
- GPT
- GPT-J
- GPT-Nemo
- GPT-NeoX
- LLaMA
- LLaMA-v2
- MPT
- OPT
- SantaCoder
- StarCoder
Please refer to the performance page for performance numbers. That page contains measured numbers for four variants of popular models (GPT-J, LLAMA-7B, LLAMA-70B, Falcon-180B), measured on the H100, L40S and A100 GPU(s).
This document describes the different quantization methods implemented in TensorRT-LLM and contains a support matrix for the different models.
TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching). It's a technique that aims at reducing wait times in queues, eliminating the need for padding requests and allowing for higher GPU utilization.
TensorRT-LLM implements several variants of the Attention mechanism that appears in most the Large Language Models. This document summarizes those implementations and how they are optimized in TensorRT-LLM.
TensorRT-LLM uses a declarative approach to define neural networks and contains techniques to optimize the underlying graph. For more details, please refer to doc
TensorRT-LLM provides C++ and Python tools to perform benchmarking. Note, however, that it is recommended to use the C++ version.
-
It's recommended to add options
–shm-size=1g –ulimit memlock=-1
to the docker or nvidia-docker run command. Otherwise you may see NCCL errors when running multiple GPU inferences. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#errors for details. -
When building models, memory-related issues such as
[09/23/2023-03:13:00] [TRT] [E] 9: GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types
[09/23/2023-03:13:00] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types)
may happen. One possible solution is to reduce the amount of memory needed by
reducing the maximum batch size, input and output lengths. Another option is to
enable plugins, for example: --use_gpt_attention_plugin
.
- TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
- TensorRT-LLM v0.5.0 is the first public release.
You can use GitHub issues to report issues with TensorRT-LLM.