PaddlePaddle/ERNIE var-len demo

Features

This repo demos the ability of above powerful features.

Quick Start

Clone the repository

$ git clone https://github.com/zlsh80826/ERNIE-varlen-demo.git
$ cd ernie-varlen-benchmark
$ git checkout sparsity

Download the models, data, TensorRT8

After downloading, extract models.tar.xz and data.tar.xz under this repo directory. (No need to extract TensorRT.*.tar.gz, we will copy it to the container and extract it inside the container)

tar xf models.tar.xz
tar xf data.tar.xz

Then your directory will be like

$ tree .
.
├── benchmark-mig.sh
├── benchmark.sh
├── data
├── Dockerfile
├── models
├── README.md
├── scripts
│   ├── build.sh
│   ├── cbenchmark.py
│   ├── launch.sh
│   └── utils.py
├── src
│   ├── CMakeLists.txt
│   └── inference.cu
└── TensorRT-8.0.1.6.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz

Build the image

The image builds PaddlePaddle and ERNIE-varlen-demo. The building step may take 30-90 minutes depend on the CPU model. If your system memory is lower than 32GB, please modify the Dockerfile for using less threads to compile Paddle (default use nproc). If you are going to benchmark with MIG, please configure the MIG before executing launch.sh

$ bash scripts/build.sh
$ bash scripts/launch.sh
$ # enter the container

Benchmark

After entering the container, you can start the benchmark by

$ bash benchmark.sh

Benchmark with MIG

If you are going to benchmark with MIG-ALL (run benchmark simultaneously on all mig), please enable and configure MIG before entering the container. The MIG-ALL benchmark has two parts. The first part executes the normal benchmark and generate the serialized trt engine file for second part. The second part then read the generated trt engine file to run benchmark on each mig to simulate the performance after enabling the mig.

$ bash benchmark.sh
$ bash benchmark-mig.sh

Benchmark Results

Notes

  1. The following results were obtained on Intel(R) Xeon(R) Silver 4210R CPU with performance mode. The GPU frequency was set on default clock.
sudo cpupower frequency-set -g performance
sudo nvidia-smi -rac
sudo nvidia-smi -rgc
  1. The MIG setting on A100 was to split the GPU to 7 instances. The results on the following tables were obtained on one of the instances.
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C
  1. The MIG setting on A30 was to split the GPU to 4 instances. The results on the following tables were obtained on one of the instances.
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 14,14,14,14 -C

Sentences/second on dense weight

Batch Size 1 2 4 8 16 32 64 128 256
A10 184 333 522 833 1014 1399 1505 1618 1660
A30 228 392 675 932 1735 2061 2304 2454 2549
A30-MIG 105 139 260 367 496 622 687 733 761
A30-MIG-ALL 420 556 1024 1415 1839 2213 2394 2475 2549
A100 297 532 909 1311 2621 3346 3969 4245 4485
A100-MIG 93 136 284 363 491 602 662 701 726

Sentences/second on sparse weight

Batch Size 1 2 4 8 16 32 64 128 256
A10 243 468 812 1267 1703 1885 2060 2136 2192
A30 266 476 883 1407 2174 2740 3061 3182 3379
A30-MIG 134 233 410 578 756 863 932 993 1026
A30-MIG-ALL 533 927 1621 2235 2834 3136 3362 3501 3603
A100 307 595 1061 1869 3161 4152 5383 5786 5973
A100-MIG 119 215 384 572 745 844 933 987 1019

Sparse weight speedup

Batch Size 1 2 4 8 16 32 64 128 256
A10 1.32 1.41 1.56 1.52 1.68 1.35 1.37 1.32 1.32
A30 1.17 1.21 1.54 1.51 1.25 1.33 1.33 1.30 1.33
A30-MIG 1.58 1.68 1.58 1.58 1.52 1.39 1.36 1.35 1.35
A30-MIG-ALL 1.27 1.67 1.58 1.58 1.54 1.41 1.40 1.41 1.41
A100 1.03 1.12 1.17 1.43 1.21 1.24 1.36 1.36 1.33
A100-MIG 1.28 1.58 1.35 1.58 1.51 1.40 1.41 1.41 1.40