This is a collection of GPU micro benchmarks. Each test is designed to test a particular scenario or hardware mechanism. Some of the benchmarks have been used to produce data for these papers:
"Analytical performance estimation during code generation on modern GPUs"
"Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs"
Benchmarks that are called gpu-<benchmarkname>
are hipifyable! Whereas the default Makefile target builds the CUDA executable cuda-<benchmarkname>
, the target make hip-<benchmarkname>
uses the hipify-perl tool to create a file main.hip
from the main.cu
file, and builds it using the hip compiler. The CUDA main files are written so that the hipify tool works without further intervention.
Also have a look at the gpu-metrics functions, which provide a concise way of measuring hardware performance counter metrics of a kernel launch inside the running program.
If any of this is useful, stars and citations are welcome!
Measures the bandwidth of streaming kernels for varying occupancy. A shared memory allocation serves as a spoiler, so that only two thread blocks can run per SM. Scanning the thread block size from 32 to 1024 scans the occupancy from 3% to 100%.
Kernel | Formula | |
---|---|---|
init | A[i] = c | 1 store stream |
read | sum = A[i] | 1 load stream |
scale | A[i] = B[i] * c | 1 load stream, 1 store stream |
triad | A[i] = B[i] + c * C[i] | 2 load streams, 1 store stream |
3pt | A[i] = B[i-1] + B[i] + B[i+1] | 1 load streams, 1 store stream |
5pt | A[i] = B[i-2] + B[i-1] + B[i] + B[i+1] + B[i+2] | 1 load streams, 1 store stream |
Results from a NVIDIA-H100-PCIe / CUDA 11.7
blockSize threads %occ | init read scale triad 3pt 5pt
32 3648 3 % | GB/s: 228 96 183 254 168 164
64 7296 6.2 % | GB/s: 452 189 341 459 316 310
96 10944 9.4 % | GB/s: 676 277 472 635 443 436
128 14592 12.5 % | GB/s: 888 368 607 821 567 558
160 18240 15.6 % | GB/s: 1093 449 704 966 680 670
192 21888 18.8 % | GB/s: 1301 533 817 1121 794 781
224 25536 21.9 % | GB/s: 1495 612 925 1264 903 889
256 29184 25.0 % | GB/s: 1686 702 1037 1399 1005 989
288 32832 28.1 % | GB/s: 1832 764 1124 1487 1100 1082
320 36480 31.2 % | GB/s: 2015 841 1213 1564 1188 1169
352 40128 34.4 % | GB/s: 2016 908 1295 1615 1269 1250
384 43776 37.5 % | GB/s: 2016 985 1378 1644 1348 1326
416 47424 40.6 % | GB/s: 2016 1045 1439 1641 1415 1395
448 51072 43.8 % | GB/s: 2016 1116 1497 1649 1472 1453
480 54720 46.9 % | GB/s: 2016 1179 1544 1655 1521 1505
512 58368 50.0 % | GB/s: 2017 1261 1583 1675 1556 1545
544 62016 53.1 % | GB/s: 2016 1300 1591 1669 1572 1563
576 65664 56.2 % | GB/s: 2016 1362 1607 1678 1587 1579
608 69312 59.4 % | GB/s: 2018 1416 1619 1689 1598 1592
640 72960 62.5 % | GB/s: 2016 1473 1639 1712 1613 1607
672 76608 65.6 % | GB/s: 2016 1527 1638 1714 1618 1613
704 80256 68.8 % | GB/s: 2015 1578 1644 1725 1625 1619
736 83904 71.9 % | GB/s: 2016 1624 1651 1738 1632 1628
768 87552 75.0 % | GB/s: 2016 1680 1666 1755 1642 1638
800 91200 78.1 % | GB/s: 2015 1714 1663 1758 1645 1642
832 94848 81.2 % | GB/s: 2016 1759 1668 1770 1649 1647
864 98496 84.4 % | GB/s: 2016 1795 1673 1779 1654 1651
896 102144 87.5 % | GB/s: 2016 1837 1686 1796 1663 1662
928 105792 90.6 % | GB/s: 2018 1871 1684 1800 1666 1664
960 109440 93.8 % | GB/s: 2016 1897 1688 1808 1672 1670
992 113088 96.9 % | GB/s: 2016 1919 1693 1818 1678 1675
1024 116736 100.0 % | GB/s: 2016 1942 1704 1832 1686 1683
The results for the SCALE kernel and a selection of GPUs:
Note that the H100 results are for the PCIe version, which has lower DRAM bandwidth than the SXM version!
Pointer chasing benchmark for latency measurement. A single warp fully traverses a buffer in random order. A partitioning scheme is used to ensure that all cache lines are hit exactly once before they are accessed again. Latency in clock cycles is computed with the current clock rate.
Sharp L1 cache transitions at 128/192/256 kB for NVIDIAS V100/A100/H100 and at 16kB for AMD's MI210. V100 and MI210 both have a 6MB L2 cache. The A100's and H100 have a segmented L2 cache at 2x20MB and 2x25MB, which manifests as a small intermediate plateau when data is fetched from the far L2 section.
The RDNA2 GPU, the RX6900XT, has the most interesting cache hierarchy with its 4 cache levels are clearly visible: the 16kB L0 cache, the 128kB semi-shared L1 cache, the 4MB L2 cache, and the 128MB Infinity cache. It is also the highest clocking GPU, so that the absolute access times would be lower than the other GPUs. Measuring its DRAM latency is difficult, because the DRAM interface does not clock up for a single wavefront, resulting in DRAM latencies > 2000 cycles.
Measures bandwidths of the first and second cache level. Launches one thread block per SM. Each thread block repeatedly reads the contents of the same buffer. Varying buffer sizes changes the targeted cache level.
The 16kB (MI100/MI210), 128kB (V100), 192kB (A100) and 256 kB (H100) L1 cache capacities are very pronounced and sharp. The three NVIDIA architectures both transfer close to 128B/cycle/SM, the maximum measured value on AMD's MI100 and MI210 depends on the data type. For double precision, the maximum is 32B/cycle/CU. For single precision and 16B data types (either float4 or double2) the bandwidth is up to 64B.
This benchmark does not target the memory hierarchy levels past the second cache level (i.e. DRAM for most GPUs), because the data sets do not clearly drop out of a shared cache. Because all thread blocks read the same data, there is a lot of reuse potential inside shared cache before the data is evicted. The RX6900XT values are bonkers past its 128kB shared L1 cache. A100 and H100 drop slightly at 20/25MB, when the capacity of a single cache section is exceeded. Beyond this point, data cannot be replicated in both L2 cache sections and the maximum bandwidth drops, as data has also to be fetched from the other section.
Measures bandwidths of shared cache levels. This benchmark explicitly does not target the L1 caches.
All three GPUs have a similar L2 cache bandwidths of about 5.x TB/s, though with different capactities.
A remarkable observation is the RX6900XT, which has a second shared cache level, the 128MB Infinity Cache. At almost 1.92 TB/s, it is as fast as the A100's DRAM. At the very beginning, the RX6900XT semi-shared L1 cache can be seen, where for some block placements the 4 L1 caches have a small effect. The same applies to the H100, which has a larger L1 cache with an increased chance for a thread block to find the data it wants to work on already in the L1 cache loaded in by the previous thread block. This only works for the small data sets, where there are only a few different data blocks and this chance is still significant. This is not attributable to the Distributed Shared Memory Network, that allows to load from other SM's shared memory, because it only works for explicit shared memory loads and not global loads. This would require tag checking every L1 cache in the GPC for any load.
Read only, L1 cache benchmark that accesses memory with strides 1 to 128. The bandwidth is converted to Bytes per cycle and SM. The strides from 1 to 128 are formatted in a 16x8 tableau, because that highlights the recurring patterns of multiples of 2/4/8/16.
These multiples are important for NVIDIA's architecture, which clearly have their L1 cache structured in a 16 banks of 8B. For strides that are a multiple of 16, every single thread accesses data from the same cache bank. The rate of address translation is reduced when addresses do not fall into the same 128B cache line anymore.
AMD's MI210 appears to have even more banks, with especially stark slowdowns to less than 4B/cycle for multiples of 32.
Testing the stencil-like, 2D structured grid access with different thread block shapes reveals differences in the L1 cache throughput:
(see the generated machine code of MI210 and A100 here: https://godbolt.org/z/1PvWqs9Kf)
AMD's MI210 is fine (at its much lower level), as long as contiguous blocks of at least 4 threads are accessed. NVIDIA's only reach their maximum throughput for 16 wide thread blocks.
Along with the L1 cache size increass, both Ampere and Hopper also slightly improve the rate of L1 cache address lookups.
This program scans a range of Computational Intensities, by varying the amount of inner loop trips. It is suitable both to study the transition from memory- to compute bound codes as well as power consumption, clock frequencies and temperatures when using multiple GPUs. The shell script series.sh builds an executable for each value, and executes them one afer another after finishing building.
The Code runs simultaneously on all available devices. Example output on four Tesla V100 PCIe 16GB:
1 640 blocks 0 its 0.125 Fl/B 869 GB/s 109 GF/s 1380 Mhz 138 W 60°C
2 640 blocks 0 its 0.125 Fl/B 869 GB/s 109 GF/s 1380 Mhz 137 W 59°C
3 640 blocks 0 its 0.125 Fl/B 869 GB/s 109 GF/s 1380 Mhz 124 W 56°C
0 640 blocks 0 its 0.125 Fl/B 869 GB/s 109 GF/s 1380 Mhz 124 W 54°C
1 640 blocks 8 its 1.125 Fl/B 861 GB/s 968 GF/s 1380 Mhz 159 W 63°C
0 640 blocks 8 its 1.125 Fl/B 861 GB/s 968 GF/s 1380 Mhz 142 W 56°C
2 640 blocks 8 its 1.125 Fl/B 861 GB/s 968 GF/s 1380 Mhz 157 W 62°C
3 640 blocks 8 its 1.125 Fl/B 861 GB/s 968 GF/s 1380 Mhz 144 W 59°C
[...]
0 640 blocks 64 its 8.125 Fl/B 811 GB/s 6587 GF/s 1380 Mhz 223 W 63°C
3 640 blocks 64 its 8.125 Fl/B 813 GB/s 6604 GF/s 1380 Mhz 230 W 66°C
1 640 blocks 64 its 8.125 Fl/B 812 GB/s 6595 GF/s 1380 Mhz 241 W 71°C
2 640 blocks 64 its 8.125 Fl/B 813 GB/s 6603 GF/s 1380 Mhz 243 W 69°C
Measures the host-to-device transfer rate of the cudaMemcpy function over a range of transfer sizes
Example output for a Tesla V100 PCIe 16GB
1kB 0.03ms 0.03GB/s 0.68%
2kB 0.03ms 0.06GB/s 5.69%
4kB 0.03ms 0.12GB/s 8.97%
8kB 0.03ms 0.24GB/s 6.25%
16kB 0.04ms 0.44GB/s 5.16%
32kB 0.04ms 0.93GB/s 2.70%
64kB 0.04ms 1.77GB/s 5.16%
128kB 0.04ms 3.46GB/s 7.55%
256kB 0.05ms 5.27GB/s 1.92%
512kB 0.07ms 7.53GB/s 1.03%
1024kB 0.11ms 9.25GB/s 2.52%
2048kB 0.20ms 10.50GB/s 1.07%
4096kB 0.37ms 11.41GB/s 0.58%
8192kB 0.71ms 11.86GB/s 0.44%
16384kB 1.38ms 12.11GB/s 0.14%
32768kB 2.74ms 12.23GB/s 0.03%
65536kB 5.46ms 12.29GB/s 0.08%
131072kB 10.89ms 12.32GB/s 0.02%
262144kB 21.75ms 12.34GB/s 0.00%
524288kB 43.47ms 12.35GB/s 0.00%
1048576kB 86.91ms 12.35GB/s 0.00%
Measures CUDA Unified Memory transfer rate using a STREAM triad kernel. A range of data set sizes is used, both smaller and larger than the device memory. Example output on a Tesla V100 PCIe 16GB:
buffer size time spread bandwidth
24 MB 0.1ms 3.2% 426.2GB/s
48 MB 0.1ms 24.2% 511.6GB/s
96 MB 0.1ms 0.8% 688.0GB/s
192 MB 0.3ms 1.8% 700.0GB/s
384 MB 0.5ms 0.5% 764.6GB/s
768 MB 1.0ms 0.2% 801.8GB/s
1536 MB 2.0ms 0.0% 816.9GB/s
3072 MB 3.9ms 0.1% 822.9GB/s
6144 MB 7.8ms 0.2% 823.8GB/s
12288 MB 15.7ms 0.1% 822.1GB/s
24576 MB 5108.3ms 0.5% 5.0GB/s
49152 MB 10284.7ms 0.8% 5.0GB/s
Measures the latency and throughput of FMA, DIV and SQRT operation. It scans combinations of ILP=1..8, by generating 1..8 independent dependency chains, and TLP, by varying the warp count on a SM from 1 to 32. The final output is a ILP/TLP table, with the reciprocal throughputs (cycles per operation):
Example output on a Tesla V100 PCIe 16GB:
DFMA
8.67 4.63 4.57 4.66 4.63 4.72 4.79 4.97
4.29 2.32 2.29 2.33 2.32 2.36 2.39 2.48
2.14 1.16 1.14 1.17 1.16 1.18 1.20 1.24
1.08 1.05 1.05 1.08 1.08 1.10 1.12 1.14
1.03 1.04 1.04 1.08 1.07 1.10 1.11 1.14
1.03 1.04 1.04 1.08 1.07 1.10 1.10 1.14
DDIV
111.55 111.53 111.53 111.53 111.53 668.46 779.75 891.05
55.76 55.77 55.76 55.76 55.76 334.26 389.86 445.51
27.88 27.88 27.88 27.88 27.88 167.12 194.96 222.82
14.11 14.11 14.11 14.11 14.11 84.77 98.89 113.00
8.48 8.48 8.48 8.48 8.48 50.89 59.36 67.84
7.51 7.51 7.51 7.51 7.51 44.98 52.48 59.97
DSQRT
101.26 101.26 101.26 101.26 101.26 612.76 714.79 816.83
50.63 50.62 50.63 50.63 50.62 306.36 357.38 408.40
25.31 25.31 25.31 25.31 25.31 153.18 178.68 204.19
13.56 13.56 13.56 13.56 13.56 82.75 96.83 110.29
9.80 9.80 9.80 9.80 9.80 60.47 70.54 80.62
9.61 9.61 9.61 9.61 9.61 58.91 68.72 78.53
Some Features can be extracted from the plot.
Latencies:
- DFMA: 8 cycles
- DDIV: 112 cycles
- DSQRT: 101 cycles
Throughput of one warp (runs on one SM quadrant), no dependencies:
- DFMA: 1/4 per cycle (ILP 2, to ops overlap)
- DDIV: 1/112 per cycle (no ILP/overlap)
- DSQRT: 1/101 per cycle (no ILP/overlap)
Throughput of multiple warps (all SM quadrants), dependencies irrelevant:
- DFMA: 1 per cycle
- DDIV: 1/7.5 cycles
- DSQRT: 1/9.6 cycles