- TensorRT 7.2 provides new plugins for var-len BERT.
- TensorRT 8.0 supports NVIDIA sparse tensor core.
This repo demos the ability of above powerful features.
$ git clone https://github.com/zlsh80826/ERNIE-varlen-demo.git
$ cd ernie-varlen-benchmark
$ git checkout sparsity
- Download the models and data through the links.
- Download the TensorRT8 GA
After downloading, extract models.tar.xz
and data.tar.xz
under this repo directory. (No need to extract TensorRT.*.tar.gz
, we will copy it to the container and extract it inside the container)
tar xf models.tar.xz
tar xf data.tar.xz
Then your directory will be like
$ tree .
.
├── benchmark-mig.sh
├── benchmark.sh
├── data
├── Dockerfile
├── models
├── README.md
├── scripts
│ ├── build.sh
│ ├── cbenchmark.py
│ ├── launch.sh
│ └── utils.py
├── src
│ ├── CMakeLists.txt
│ └── inference.cu
└── TensorRT-8.0.1.6.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz
The image builds PaddlePaddle and ERNIE-varlen-demo. The building step may take 30-90 minutes depend on the CPU model.
If your system memory is lower than 32GB, please modify the Dockerfile for using less threads to compile Paddle (default use nproc
).
If you are going to benchmark with MIG, please configure the MIG before executing launch.sh
$ bash scripts/build.sh
$ bash scripts/launch.sh
$ # enter the container
After entering the container, you can start the benchmark by
$ bash benchmark.sh
If you are going to benchmark with MIG-ALL (run benchmark simultaneously on all mig), please enable and configure MIG before entering the container. The MIG-ALL benchmark has two parts. The first part executes the normal benchmark and generate the serialized trt engine file for second part. The second part then read the generated trt engine file to run benchmark on each mig to simulate the performance after enabling the mig.
$ bash benchmark.sh
$ bash benchmark-mig.sh
- The following results were obtained on Intel(R) Xeon(R) Silver 4210R CPU with performance mode. The GPU frequency was set on default clock.
sudo cpupower frequency-set -g performance
sudo nvidia-smi -rac
sudo nvidia-smi -rgc
- The MIG setting on A100 was to split the GPU to 7 instances. The results on the following tables were obtained on one of the instances.
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C
- The MIG setting on A30 was to split the GPU to 4 instances. The results on the following tables were obtained on one of the instances.
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 14,14,14,14 -C
Batch Size | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
---|---|---|---|---|---|---|---|---|---|
A10 | 184 | 333 | 522 | 833 | 1014 | 1399 | 1505 | 1618 | 1660 |
A30 | 228 | 392 | 675 | 932 | 1735 | 2061 | 2304 | 2454 | 2549 |
A30-MIG | 105 | 139 | 260 | 367 | 496 | 622 | 687 | 733 | 761 |
A30-MIG-ALL | 420 | 556 | 1024 | 1415 | 1839 | 2213 | 2394 | 2475 | 2549 |
A100 | 297 | 532 | 909 | 1311 | 2621 | 3346 | 3969 | 4245 | 4485 |
A100-MIG | 93 | 136 | 284 | 363 | 491 | 602 | 662 | 701 | 726 |
Batch Size | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
---|---|---|---|---|---|---|---|---|---|
A10 | 243 | 468 | 812 | 1267 | 1703 | 1885 | 2060 | 2136 | 2192 |
A30 | 266 | 476 | 883 | 1407 | 2174 | 2740 | 3061 | 3182 | 3379 |
A30-MIG | 134 | 233 | 410 | 578 | 756 | 863 | 932 | 993 | 1026 |
A30-MIG-ALL | 533 | 927 | 1621 | 2235 | 2834 | 3136 | 3362 | 3501 | 3603 |
A100 | 307 | 595 | 1061 | 1869 | 3161 | 4152 | 5383 | 5786 | 5973 |
A100-MIG | 119 | 215 | 384 | 572 | 745 | 844 | 933 | 987 | 1019 |
Batch Size | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
---|---|---|---|---|---|---|---|---|---|
A10 | 1.32 | 1.41 | 1.56 | 1.52 | 1.68 | 1.35 | 1.37 | 1.32 | 1.32 |
A30 | 1.17 | 1.21 | 1.54 | 1.51 | 1.25 | 1.33 | 1.33 | 1.30 | 1.33 |
A30-MIG | 1.58 | 1.68 | 1.58 | 1.58 | 1.52 | 1.39 | 1.36 | 1.35 | 1.35 |
A30-MIG-ALL | 1.27 | 1.67 | 1.58 | 1.58 | 1.54 | 1.41 | 1.40 | 1.41 | 1.41 |
A100 | 1.03 | 1.12 | 1.17 | 1.43 | 1.21 | 1.24 | 1.36 | 1.36 | 1.33 |
A100-MIG | 1.28 | 1.58 | 1.35 | 1.58 | 1.51 | 1.40 | 1.41 | 1.41 | 1.40 |