/mnist-fastapi-aio-triton

Simple example of FastAPI + gRPC AsyncIO + Triton

Primary LanguagePythonMIT LicenseMIT

You can see the previous works from:

  1. https://github.com/Curt-Park/triton-inference-server-practice (00-quick-start)
  2. https://github.com/Curt-Park/producer-consumer-fastapi-celery
  3. https://github.com/Curt-Park/mnist-fastapi-celery-triton

FastAPI + Triton (AsyncIO & gRPC)

Preparation

1. Setup packages

Install Anaconda and execute the following commands:

$ make env        # create a conda environment (need only once)
$ conda activate mnist-fastapi-aio-triton
$ make setup      # setup packages (need only once)

2. Train a CNN model (Recommended on GPU)

$ make train
$ tree model_repository  # check the model repository created

model_repository
└── mnist_cnn
    ├── 1
    │   └── model.pt
    └── config.pbtxt

2 directories, 2 files

How to play

Server

$ make triton     # run triton server
$ make api        # run fastapi server
  • NOTE: If you want to run triton server and fastapi server on different devices, just set TRITON_SERVER_URL before running fastapi.
export TRITON_SERVER_URL=ip-address:8001

Execute Locust

$ make locust

Open http://0.0.0.0:8089 and type the api address in Host.

Experimental Result

  • CPU: AMD Ryzen Threadripper PRO 3995WX 64-Cores
  • GPU: NVIDIA GeForce RTX 3090
  • FastAPI, Triton, Locust are executed on the same device
  • a single v-user sends a request once a second

From 1,400 v-user, the response latency increases.

The following is the triton metrics. The difference of nv_inference_count and nv_inference_exec_count shows the dynamic batching works. (Details about Triton Metrics)

# Triton Metrics
$ curl localhost:8002/metrics

# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="mnist_cnn",version="1"} 395927.000000
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="mnist_cnn",version="1"} 0.000000
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="mnist_cnn",version="1"} 395927.000000
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="mnist_cnn",version="1"} 193751.000000
...
  • NOTE: The connection number is limited to open files (check this with ulimit -a command). In order to change the upper bound, you should set the maximum number of open files by ulimit -Sn 65535.
$ ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 513958
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535    # <- This line!
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 513958
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Further Steps for k8s