/omp-benchmark-for-pytorch

Benchmark omp threshold for pytorch.

Primary LanguagePython

Pytorch element-wise operation optimization benchmark

1. Abstract

Providing a benchmark for element-wise operation performance evaluation on CPU.

Tested CPU:

CPU Model Sockets Cores/Socket Frequency
Intel(R) Xeon(R) CPU E5-2699 v4 2 22 2.20GHz
Intel(R) Xeon(R) Platinum 8180 CPU 2 28 2.50GHz
Intel(R) Core(TM) i7-5960X CPU 1 8 3.00GHz

Tested operations:

copy add div sin exp sum prod

Conclusions:

  • OpenMP threshold which is set to 100k in official version is too high for contiguous tensors of small and medium size to benefit from OpenMP parallelism.
  • Discontiguous tensors' operations can be boosted significantly by Intel Pytorch .
  • The optimal OpenMP threshold is dependent on the operation type and CPU type.
    • OpenMP threshold becomes smaller for more complex operations.
    • OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor.

annotation:
OpenMP threshold -- If the size of a tensor is larger than the value, the operations run in parallel, otherwise in serial.

This benchmark also gives a rough estimation of optimal OpenMP threshold of copy, add, div, exp, sin, sum and prod operation on different types of CPU.

For contiguous tensor operation:

  Xeon(R) Platinum 8180 CPU Xeon(R) CPU E5-2699 v4 i7-5960X CPU
copy 80k 20k 8k
add 80k 20k 8k
div 50k 10k 2k
exp 1k 1k 1k
sin 1k 1k 1k
sum 1k 1k 1k
prod 1k 1k 1k

For discontiguous tensor operation:

Xeon(R) Platinum 8180 CPU Xeon(R) CPU E5-2699 v4 i7-5960X CPU
copy 20k 8k 2k
add 20k 8k 2k
div 10k 8k 1k
exp 1k 1k 1k
sin 2k 2k 1k
sum 1k 1k 1k
prod 1k 1k 1k

2. Major work

  • Optimal OpenMP threshold is identified to fully exploit performance potentiality on CPU
    The OpenMP threshold of official Pytorch is set to 100K. However, the evidence gained by benchmarking copy, add, div, exp, sin operation in both contiguous and discontiguous cases on different CPU types shows that the value is too high. A rough estimation of optimal OpenMP threshold is also proposed for those operations.
  • Discontiguous tensors' operation parallelization with OpenMP
    Slice operation of tensor is very common in science computation. Using slice operation will generate discontiguous tensor. Meanwhile, Official Pytorch does not support parallelism of discontiguous tensor at the moment. Our main work is trying to fill this blank. Code available at dev-omp and upstreaming is in progress.

3. Installation and test

3.1 Installation

Official Pytorch

Please refer to official link

Intel Pytorch

Download Intel pytorch source code.

git clone --recursive -b dev-omp2 https://github.com/intel/pytorch.git

Before installing, you should set the CMAKE_PREFIX_PATH.

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

Install intel Pytorch

python setup.py install

3.2 Test

python benchmark.py <CONTIGUITY> <OPERATION> [OUTPUT FILENAME] 

Positional arguments:
CONTIUITY—— operands' contiguity, ontiguous/discontiguous
OPERATION—— operation, copy/add/div/sin/exp/sum/prod

Optional arguments:
o output filename——output filename, output.log is in default

4. The benchmark result

4.1 Contiguous Tensor Operation OpenMP Threshold Tuning

Add, exp operation for contiguous tensors whose sizes range from 1K to 100K are listed here as test cases. We compiled two versions of official Pytorch by setting two different OpenMP threshold. The threshold of one version is set to 100K to make all of the test case runs in series. Meanwhile the threshold of the other one is set to 800 to make all of the test case in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor Size In series In parallel SpeedUp
1k 1.04 5.15 0.20X
2k 1.23 5.47 0.22X
3k 1.33 5.34 0.24X
4k 1.47 5.41 0.27X
5k 1.48 5.40 0.27X
8k 1.81 5.55 0.32X
10k 1.98 5.66 0.35X
20k 2.74 6.74 0.40X
50k 5.12 6.59 0.77X
80k 14.79 6.59 2.24X
100k 21.97 6.70 3.27X

Conclusion: Setting the threshold to 80K is good for add operation of contiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor Size In series In parallel SpeedUp
1k 9.48 5.66 1.67X
2k 17.00 6.35 2.67X
3k 24.82 6.03 4.11X
4k 32.52 6.28 5.17X
5k 40.33 6.27 6.42X
8k 63.58 7.04 9.02X
10k 79.13 7.61 10.38X
20k 156.78 9.11 17.20X
50k 387.85 15.07 25.73X
80k 623.34 20.23 30.80X
100k 779.95 23.57 33.08X

Conclusion: Setting the threshold to 1K is good for exponential operation of contiguous tensors.

From above results, it is easy to understand that,

  • Different operations have their own optimal OpenMP threshold, but 100K is not suitable.
  • OpenMP threshold becomes smaller for more complex operations.

We don't list all the detailed data for div, sin, sum and prod operation but provide a rough estimation of optimal OpenMP threshold for different operations.

4.2 Discontiguous tensor operation parallelization

Add and exp operation performance for discontiguous tensors whose sizes range from 1k to 180k are listed. Official pytorch does not optimize operations for discontiguous tensors with OpenMP but Intel version does. In order to expalin that OpenMP also do good in discontiguous tensor operations and to find a optimal OpenMP threshold, we compiled two versions of Pytorch. One is the Official Pytorch. The other one is the Intel one whose OpenMP threshold is set to 800 to make all test cases run in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor Size In series In parallel SpeedUp
1k 1.69 6.98 0.24X
2k 2.42 7.47 0.32X
3k 3.12 7.38 0.42X
4k 3.77 7.43 0.50X
5k 4.46 7.47 0.59X
8k 6.44 7.49 0.85X
10k 7.82 7.69 1.01X
20k 14.54 7.80 1.86X
50k 34.35 8.31 4.13X
80k 54.80 8.68 6.31X
100k 68.82 9.07 7.58X
110k 75.92 8.99 8.43X
120k 83.03 9.52 8.71X
150k 104.24 9.92 10.50X
180k 124.28 10.68 11.62X

Conclusion: Setting the threshold to 10K is good for add operation of discontiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor Size In series In parallel SpeedUp
1k 10.02 7.27 1.37X
2k 19.01 7.83 2.42X
3k 27.73 7.48 3.70X
4k 36.45 7.66 4.75X
5k 45.26 8.13 5.56X
8k 71.36 8.70 8.19X
10k 88.75 9.15 9.69X
20k 176.26 11.32 15.56X
50k 439.68 19.07 23.04X
80k 700.40 26.99 25.94X
100k 876.42 27.61 31.73X
110k 983.76 29.79 33.01X
120k 1050.07 31.87 32.94X
150k 1341.23 37.59 35.67X
180k 1584.88 43.27 36.62X

Conclusion: Setting the threshold to 1K is good exponential operation of contiguous tensors.

Conclusions:

  • Discontiguous operation can be improved a lot by using OpenMP optimization.
  • OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor because the same operation of discontiguous tensor is more time-consuming than contiguous tensor.

4.3 LSTM benchmark test

To consolidate the performance boost benefiting from the elementwise optimization, we choose the a widely-used RNN unit: LSTM as the model-level benchmark reference. This is because:

  1. LSTM related computations involve considerable elementwise operations;
  2. PyTorch provides a scalable and flexible Python API to execute LSTM computation.

We retrieve the LSTM benchmark via the script: https://github.com/xhzhao/pytorch-rnn-benchmark , and in which,

  1. The Python API torch.nn.LSTM is used as the entry of LSTM computation.
  2. We run the benchmarks on 24 selective input shapes utilized by different NLP models,
  3. The unit for benchmarks is Sentence Per Second (SPS). [N, T, D, Z] stands for batch size, embedding size, sentence length and hidden size. Specifically, The [64, 50, 500, 500] is used by OpenNMT. The [64, 25, 4096, 4096] is used by Deepbench.

Platform: Platinum-8180
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input Shape Xeon Platinum 8180 OOB Xeon Platinum 8180 Optimized SpeedUp
[64, 15, 500, 500] 899.4494 7393.76 8.22X
[64, 20, 500, 500] 937.1688 5895.53 6.29X
[64, 25, 500,500] 750.8159 4808.17 6.40X
[64, 30, 500,500] 625.825 2351.56 3.76X
[64, 35, 500,500] 536.1393 3446.69 6.43X
[64, 40, 500,500] 469.1356 2907.74 6.20X
[64, 45, 500,500] 417.338 2502.57 6.00X
[64, 50, 500,500] 375.6814 2412.96 6.43X
[16, 25, 512, 512] 474.9601 1325.45 2.79X
[32, 25, 512, 512] 606.5853 2394.69 3.95X
[64, 25, 512, 512] 700.1314 3661.21 5.23X
[128, 25, 512, 512] 771.5298 4931.85 6.39X
[16, 25, 1024, 1024] 195.6518 434.34 2.22X
[32, 25, 1024, 1024] 261.1828 792.48 3.03X
[64, 25, 1024, 1024] 323.7316 1174.23 3.62X
[128, 25, 1024, 1024] 458.3642 1793.54 3.91X
[16, 25, 2048, 2048] 48.7229 71.07 1.46X
[32, 25, 2048, 2048] 77.4796 131.74 1.70X
[64, 25, 2048, 2048] 132.8328 245.78 1.85X
[128, 25, 2048, 2048] 178.2548 429.59 2.41X
[16, 25, 4096, 4096] 12.4995 16.99 1.36X
[32, 25, 4096, 4096] 23.0582 28.89 1.25X
[64, 25, 4096, 4096] 39.3725 53.48 1.36X
[128, 25, 4096, 4096] 61.866 97.97 1.58X

Platform: Platinum-8180
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input Shape Xeon Platinum 8180 OOB Xeon Platinum 8180 Optimized Speed-up
[64, 15, 500, 500] 432.5038 740.19 1.71X
[64, 20, 500, 500] 385.2532 506.49 1.31X
[64, 25, 500,500] 308.066 476.33 1.55X
[64, 30, 500,500] 264.2467 406.49 1.54X
[64, 35, 500,500] 217.2079 362.4 1.67X
[64, 40, 500,500] 199.5474 321.25 1.61X
[64, 45, 500,500] 187.0923 292.01 1.56X
[64, 50, 500,500] 159.5678 255.32 1.60X
[16, 25, 512, 512] 168.2578 269.11 1.60X
[32, 25, 512, 512] 217.3134 365.27 1.68X
[64, 25, 512, 512] 273.1848 475.26 1.74X
[128, 25, 512, 512] 320.5748 549.36 1.71X
[16, 25, 1024, 1024] 62.4692 89.46 1.43X
[32, 25, 1024, 1024] 89.6243 144.03 1.61X
[64, 25, 1024, 1024] 127.414 199.49 1.57X
[128, 25, 1024, 1024] 174.6576 255.07 1.46X
[16, 25, 2048, 2048] 18.8309 25.69 1.36X
[32, 25, 2048, 2048] 30.9957 47.01 1.52X
[64, 25, 2048, 2048] 51.2821 75.98 1.48X
[128, 25, 2048, 2048] 71.7206 113.27 1.58X
[16, 25, 4096, 4096] 6.0788 7.46 1.23X
[32, 25, 4096, 4096] 10.954 13.98 1.28X
[64, 25, 4096, 4096] 18.5955 24.85 1.34X
[128, 25, 4096, 4096] 28.1366 39.01 1.39X

Platform: CPU E5-2699 v4
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input Shape Xeon E5-2699 OOB Xeon E5-2699 Optimized Speed-up
[64, 15, 500, 500]   1169.737 7149.82 6.11X
[64, 20, 500, 500]   923.5499 6033.54 6.53X
[64, 25, 500,500]   739.8101 4846.39 6.55X
[64, 30, 500,500]   618.0939 4027.08 6.52X
[64, 35, 500,500] 528.3323 3401.53 6.44X
[64, 40, 500,500] 462.2187 2972.32 6.43X
[64, 45, 500,500] 410.5386 2625.95 6.40X
[64, 50, 500,500] 369.9179 2372.84 6.41X
[16, 25, 512, 512] 639.4213 2172.63 3.40X
[32, 25, 512, 512] 680.3161 3561.47 5.24X
[64, 25, 512, 512] 727.8996 4864.45 6.68X
[128, 25, 512, 512] 760.9095 5754.56 7.56X
[16, 25, 1024, 1024] 320.0169 1381.03 4.32X
[32, 25, 1024, 1024] 349.7738 1916.54 5.48X
[64, 25, 1024, 1024] 368.3568 2265 6.15X
[128, 25, 1024, 1024] 490.1187 2518.24 5.14X
[16, 25, 2048, 2048] 137.989 383.87 2.78X
[32, 25, 2048, 2048] 159.1569 590.48 3.71X
[64, 25, 2048, 2048] 214.677 720.81 3.36X
[128, 25, 2048, 2048] 210.0029 683.88 3.26X
[16, 25, 4096, 4096] 42.7353 70.06 1.64X
[32, 25, 4096, 4096] 66.9777 126.43 1.89X
[64, 25, 4096, 4096] 82.5284 180.12 2.18X
[128, 25, 4096, 4096] 83.1054 180.03 2.17X

Platform: CPU E5-2699 v4
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input Shape Xeon E5-2699 OOB Xeon E5-2699 Optimized Speed-up
[64, 15, 500, 500] 451.2899 627.66 1.39X
[64, 20, 500, 500] 370.242 497.26 1.34X
[64, 25, 500,500] 298.1386 363.61 1.22X
[64, 30, 500,500] 251.8914 327.72 1.30X
[64, 35, 500,500] 225.749 285.99 1.27X
[64, 40, 500,500] 192.7014 271.03 1.41X
[64, 45, 500,500] 175.5287 245.5 1.40X
[64, 50, 500,500] 161.343 229.74 1.42X
[16, 25, 512, 512] 207.6788 201.7 0.97X
[32, 25, 512, 512] 250.4016 301.76 1.21X
[64, 25, 512, 512] 306.2745 429.34 1.40X
[128, 25, 512, 512] 345.1608 456.06 1.32X
[16, 25, 1024, 1024] 66.2632 67.93 1.03X
[32, 25, 1024, 1024] 37.8289 114.71 3.03X
[64, 25, 1024, 1024] 76.6716 173.85 2.27X
[128, 25, 1024, 1024] 141.6185 218 1.54X
[16, 25, 2048, 2048] 20.5789 20.82 1.01X
[32, 25, 2048, 2048] 34.5047 36.93 1.07X
[64, 25, 2048, 2048] 55.1509 62.73 1.14X
[128, 25, 2048, 2048] 71.7717 88.76 1.24X
[16, 25, 4096, 4096] 6.8679 7.09 1.03X
[32, 25, 4096, 4096] 12.5718 13.85 1.10X
[64, 25, 4096, 4096] 20.1554 23.66 1.17X
[128, 25, 4096, 4096] 27.4074 34.49 1.26X

Conclusion:

According to the benchmarks retrieved on Intel Xeon Platforms, On Platinum 8180:

  1. For LSTM inference (forward-only), the performance is get boosted from 1.25X to 8.22X.
  2. For LSTM training (forward + backward), the performance is get boosted from 1.23X to 1.74X.

On E5-2699 V4:

  1. For LSTM inference (forward-only), the performance is get boosted from 1.64X to 7.56X.
  2. For LSTM training (forward + backward), the performance is get boosted from 1.01X to 3.03X.

Test results analysis:

  1. For inference benchmarks: As the contributions of elementwise operation varies from the different input shapes, it is expected the performance boosts are not uniform with input shape changing.
  2. For training benchmarks: Apart from sharing the same reason of inference benchmarks. As the backward computation gains less from the elementwise optimization, it is expected the performance boosts on training benchmarks are not outstanding as inference benchmarks, and not uniform with input shape changing.