🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server

Yes, you can perfom inference with transformer based model in less than 1ms on the cheapest GPU available on Amazon (T4)!

The commands below have been tested on a AWS G4.dnn with Deep Learning Base AMI (Ubuntu 18.04) Version 44.0. They may require some small adaptations to be run on a another Linux distribution.

You can find explanations on how it works in Hugging Face Transformer inference UNDER 1 millisecond latency

Baseline set by Hugging Face Infinity demo

Hugging Face infinity demo video

AWS virtual machine: g4dn.xlarge (T4 GPU)
model: "philschmid/MiniLM-L6-H384-uncased-sst2" (Hugging Face hub URL)
experience 1 : batch size 1, seq len 16 tokens -> 1.7ms
experience 2 : batch size 1, seq len 128 tokens -> 2.5ms

Install dependencies

Those dependencies have to be installed on the remote machine directly (no container).

git clone git@github.com:ELS-RD/triton_transformers.git
pip3 install -r requirements.txt

Generate optimized models

We generate the models from a Docker image so we can also get measures for TensorRT + ONNX Runtime.

cd triton_transformers
DOCKER_BUILDKIT=1 docker build --tag onnxruntime-trt:latest -f Dockerfile .
docker run -it --rm --gpus all -v $PWD:/project onnxruntime-trt bash -c "cd /project && python convert_onnx.py"

⚠️WARNING⚠️: if you run the conversion outside Docker container, you may have very different timings, and TensorRT won't work

It should produce something like that:

10/31/2021 11:35:08 INFO     inference done on Tesla T4
10/31/2021 11:35:08 INFO     timing [[TensorrtExecutionProvider] ./onnx_models/model-shape.onnx]: mean=0.61ms, sd=0.11ms, min=0.52ms, max=0.92ms, median=0.54ms, 95p=0.88ms, 99p=0.90ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model.onnx]: mean=1.10ms, sd=0.10ms, min=1.04ms, max=3.44ms, median=1.07ms, 95p=1.29ms, 99p=1.36ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model-optimized.onnx]: mean=0.63ms, sd=0.05ms, min=0.60ms, max=0.84ms, median=0.61ms, 95p=0.77ms, 99p=0.79ms
10/31/2021 11:35:08 INFO     timing [Pytorch_32]: mean=5.09ms, sd=0.16ms, min=4.88ms, max=6.11ms, median=5.07ms, 95p=5.28ms, 99p=5.35ms
10/31/2021 11:35:08 INFO     timing [Pytorch_FP16]: mean=6.04ms, sd=0.74ms, min=5.77ms, max=28.79ms, median=6.05ms, 95p=6.19ms, 99p=6.29ms

TensorRT and optimized ONNX Runtime provides very similar results on short sequences. In the following steps, we will continue with ONNX Runtime model because the dynamic axis are easier to work with compared to TensorRT.

Docker build will is very slow on a G4, be patient... the docker image is only required for TensorRT support inside ONNX Runtime (and measure a difference, if any, with ONNX Runtime).

FastAPI server

This is our baseline, easy to run, but not very performant.

# launch server, disable logging for best performances
python3 -m uvicorn --log-level warning server_onnx:app --port 8000 --host 0.0.0.0
# other variation, 1 worker per CPU for best latency (plus not a good idea to have several times the same model on a single GPU):
python3 -m gunicorn -w 1 -k uvicorn.workers.UvicornWorker --log-level warning server_onnx:app --bind 0.0.0.0:8000

# simple inference timing
time curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict
# slightly more serious measure
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo perf stat -r 50 -d curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict -s > /dev/null

It should produce:

Performance counter stats for 'curl -G --data-urlencode query=This live event is great. I will sign-up for Infinity. localhost:8000/predict' (50 runs):

              6.14 msec task-clock                #    0.494 CPUs utilized            ( +-  0.59% )
                 3      context-switches          #    0.462 K/sec                    ( +-  1.84% )
                 0      cpu-migrations            #    0.000 K/sec                  
               577      page-faults               #    0.094 M/sec                    ( +-  0.06% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

         0.0124429 +- 0.0000547 seconds time elapsed  ( +-  0.44% )

Triton server

We want to copy the ONNX model we have generated in the first step in this folder. Then we launch the Triton image. As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.

# copy the generated model to triton model folder
cp ./onnx_models/model-optimized.onnx ./triton_models/sts/1/model.onnx
# install transformers (and its tokenizer) and launch server in a single line, ugly but good enough for our demo
# --shm-size 256m -> to have several Python backend at the same time
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

Triton server perf analysis

You need to edit the source code to load the 16 or 128 token sequence (the text is already included).

16 tokens:

ubuntu@ip-172-31-31-84:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
[[-3.4355469  3.2753906]]

128 tokens:

ubuntu@ip-XXX:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
[[-3.4589844  3.3027344]]

There is also a more serious performance analysis tool called perf_analyzer (it will take care to check that measures are stable, etc.). documentation The tool need to be run on Ubuntu >= 20.04 (and won't work on Ubuntu 18.04 used for the AWS official Ubuntu deep learning image): It also make measures on torchserve and tensorflow.

# perf_analyzer needs this dependency
sudo apt install libb64-dev
# add -a for async measures, and -i grpc to use that protocol instead of http 
~/.local/bin/perf_analyzer -m transformers --percentile=95 --input-data perf_data.json --shape TEXT:1 # -i grpc -a
# just test the model part (easier to get random input)
~/.local/bin/perf_analyzer --input-data zero -m sts --shape input_ids:1,16 --shape attention_mask:1,16 #-i grpc -a

Call Triton HTTP API directly

If you don't want to use the tritonclient API, you can call the Triton server those ways:

# if you like Python requests library
python3 triton_requests.py

# if you want generic HTTP template, the @ means no data conversion
curl -X POST  http://localhost:8000/v2/models/transformers/versions/1/infer \
  --data-binary "@query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

Use TensorRT model in Triton server (instead of ONNX)

To use TensorRT model instead of ONNX Runtime one:

we need to convert the ONNX to TensorRT engine
update the configuration, TensorRT takes int32 as input instead of int64

# we use Docker container to guarantee the use of the right trtexec version (otherwise you will have a deserialization error)
# it's a bacic conversion, IRL you want to provide minimum, optimimum and maximum shape at least
# it may take a few minutes...
docker run -it --rm --gpus all -v $PWD/onnx_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
    /usr/src/tensorrt/bin/trtexec \
    --onnx=/models/model.onnx \
    --best \
    --minShapes=input_ids:1x16,attention_mask:1x16 \
    --optShapes=input_ids:1x16,attention_mask:1x16 \
    --maxShapes=input_ids:32x16,attention_mask:32x16  \
    --saveEngine="/models/model.plan" \
    --workspace=6000 \
    --useCudaGraph

# move to triton model folder
cp ./onnx_models/model.plan ./triton_models/sts/1/model.plan

You then need to update you config.pbtxt in STS and tokenizer folders, replace all TYPE_INT64 tensor type by TYPE_INT32. In STS configuraiton file, replace platform: "onnxruntime_onnx" by platform: "tensorrt_plan" Finally convert the numpy tensors to int32 in the tokenizer python code, like below (notice the astype()):

input_ids = pb_utils.Tensor("INPUT_IDS", tokens['input_ids'].astype(np.int32))
attention = pb_utils.Tensor("ATTENTION", tokens['attention_mask'].astype(np.int32))

And you are done!

raghavsonavane/triton_transformers