CUDA dot tuning
jeffhammond opened this issue · 6 comments
Not a bug, just FYI, but on A100, increasing DOT_NUM_BLOCKS
increases the performance a noticeable amount.
I don't see any documentation of the need to tune this. It's possible that brute-force sampling would allow one to preprocess GPU-specific parameters that work better.
BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 8589.9 MB (=8.6 GB)
Total size: 25769.8 MB (=25.8 GB)
Using CUDA device NVIDIA A100-SXM4-80GB
Driver: 11040
Function MBytes/sec Min (sec) Max Average
Copy 1748027.292 0.00983 0.01096 0.00986
Mul 1745605.363 0.00984 0.01100 0.00987
Add 1772532.835 0.01454 0.01552 0.01456
Triad 1773862.890 0.01453 0.01547 0.01455
Dot 1555399.880 0.01105 0.01369 0.01113
with #define DOT_NUM_BLOCKS 1024
(similar with 2048):
BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 8589.9 MB (=8.6 GB)
Total size: 25769.8 MB (=25.8 GB)
Using CUDA device NVIDIA A100-SXM4-80GB
Driver: 11040
Function MBytes/sec Min (sec) Max Average
Copy 1747717.694 0.00983 0.01097 0.00987
Mul 1745785.409 0.00984 0.01101 0.00990
Add 1772356.068 0.01454 0.01585 0.01459
Triad 1773738.597 0.01453 0.01547 0.01456
Dot 1745025.034 0.00985 0.01157 0.00992
The latter is comparable to the CUDA Fortran implementation, which uses the compiler-generated reduction, which I assume is pretty close to optimal for every GPU architecture.
I am really winning at filing duplicate issues today, aren't I? 😄
In that PR, the number of blocks was set to 4 * prop.multiProcessorCount;
in a similar way to the other models that need to guess this number (i.e. OpenCL).
I am really winning at filing duplicate issues today, aren't I? 😄
Just shows us that we need to do some housekeeping ASAP... Good to bring this to the top of the pile.
For reference, the 40G A100 is showing:
BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 11040
Function MBytes/sec Min (sec) Max Average
Copy 1391400.012 0.00617 0.00619 0.00618
Mul 1389127.623 0.00618 0.00620 0.00619
Add 1398448.366 0.00921 0.00948 0.00946
Triad 1398717.522 0.00921 0.00949 0.00946
Dot 1326274.626 0.00648 0.00676 0.00651
(using the default block size)