English | 简体中文
Paddle backend requires paddle inference API, so it is necessary to have paddle inference lib.
Use build_paddle.sh to build paddle inference lib and headers. This step may takes lots of time.
$ cd paddle-lib
$ bash build_paddle.sh
$ cd .. # back to root of paddle_backend
After paddle is successfully built, please check a directory called paddle
is under paddle-lib directory.
Build libtriton_paddle.so
by scripts/build_paddle_backend.sh
$ bash scripts/build_paddle_backend.sh
The model repository is the directory where you place the models that you want Triton to server. An example model repository is included in the examples. Before using the repository, you must fetch it by the following scripts.
$ cd examples
$ ./fetch_models.sh
$ cd .. # back to root of paddle_backend
Launch triton inference server with single GPU, you can change any docker related configurations in scripts/launch_triton_server.sh if necessary.
$ bash scripts/launch_triton_server.sh
Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.
$ curl -v localhost:8000/v2/health/ready
...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.
Before running the examples, please make sure the triton server is running correctly.
Change working directory to examples and download the data
$ cd examples
$ ./fetch_perf_data.sh # download benchmark input
ERNIE-2.0 is a pre-training framework for language understanding.
Steps to run the benchmark on ERNIE
$ bash perf_ernie.sh
The ResNet50-v1.5 is a modified version of the original ResNet50 v1 model.
Steps to run the benchmark on ResNet50-v1.5
$ bash perf_resnet50_v1.5.sh
Steps to run the inference on ResNet50-v1.5.
-
Prepare processed images following DeepLearningExamples and place
imagenet
folder under examples directory. -
Run the inference
$ bash infer_resnet_v1.5.sh imagenet/<id>
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 270.0 | 3.813 | 3.846 | 4.007 | 3.692 |
FP16 | TensorRT | 2 | 500.4 | 4.282 | 4.332 | 4.709 | 3.980 |
FP16 | TensorRT | 4 | 831.2 | 5.141 | 5.242 | 5.569 | 4.797 |
FP16 | TensorRT | 8 | 1128.0 | 7.788 | 7.949 | 8.255 | 7.089 |
FP16 | TensorRT | 16 | 1363.2 | 12.702 | 12.993 | 13.507 | 11.738 |
FP16 | TensorRT | 32 | 1529.6 | 22.495 | 22.817 | 24.634 | 20.901 |
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 288.8 | 3.494 | 3.524 | 3.608 | 3.462 |
FP16 | TensorRT | 2 | 494.0 | 4.083 | 4.110 | 4.208 | 4.047 |
FP16 | TensorRT | 4 | 758.4 | 5.327 | 5.359 | 5.460 | 5.273 |
FP16 | TensorRT | 8 | 1044.8 | 7.728 | 7.770 | 7.949 | 7.658 |
FP16 | TensorRT | 16 | 1267.2 | 12.742 | 12.810 | 13.883 | 12.647 |
FP16 | TensorRT | 32 | 1113.6 | 28.840 | 29.044 | 30.357 | 28.641 |
FP16 | TensorRT | 64 | 1100.8 | 58.512 | 58.642 | 59.967 | 58.251 |
FP16 | TensorRT | 128 | 1049.6 | 121.371 | 121.834 | 123.371 | 119.991 |
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 291.8 | 3.471 | 3.489 | 3.531 | 3.427 |
FP16 | TensorRT | 2 | 466.0 | 4.323 | 4.336 | 4.382 | 4.288 |
FP16 | TensorRT | 4 | 665.6 | 6.031 | 6.071 | 6.142 | 6.011 |
FP16 | TensorRT | 8 | 833.6 | 9.662 | 9.684 | 9.767 | 9.609 |
FP16 | TensorRT | 16 | 899.2 | 18.061 | 18.208 | 18.899 | 17.748 |
FP16 | TensorRT | 32 | 761.6 | 42.333 | 43.456 | 44.167 | 41.740 |
FP16 | TensorRT | 64 | 793.6 | 79.860 | 80.410 | 80.807 | 79.680 |
FP16 | TensorRT | 128 | 793.6 | 158.207 | 158.278 | 158.643 | 157.543 |