UoB-HPC/BabelStream

CUDA dot tuning

jeffhammond opened this issue · 6 comments

Not a bug, just FYI, but on A100, increasing DOT_NUM_BLOCKS increases the performance a noticeable amount.

I don't see any documentation of the need to tune this. It's possible that brute-force sampling would allow one to preprocess GPU-specific parameters that work better.

BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 8589.9 MB (=8.6 GB)
Total size: 25769.8 MB (=25.8 GB)
Using CUDA device NVIDIA A100-SXM4-80GB
Driver: 11040
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1748027.292 0.00983     0.01096     0.00986
Mul         1745605.363 0.00984     0.01100     0.00987
Add         1772532.835 0.01454     0.01552     0.01456
Triad       1773862.890 0.01453     0.01547     0.01455
Dot         1555399.880 0.01105     0.01369     0.01113

with #define DOT_NUM_BLOCKS 1024 (similar with 2048):

BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 8589.9 MB (=8.6 GB)
Total size: 25769.8 MB (=25.8 GB)
Using CUDA device NVIDIA A100-SXM4-80GB
Driver: 11040
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1747717.694 0.00983     0.01097     0.00987
Mul         1745785.409 0.00984     0.01101     0.00990
Add         1772356.068 0.01454     0.01585     0.01459
Triad       1773738.597 0.01453     0.01547     0.01456
Dot         1745025.034 0.00985     0.01157     0.00992

The latter is comparable to the CUDA Fortran implementation, which uses the compiler-generated reduction, which I assume is pretty close to optimal for every GPU architecture.

Related to #122

I am really winning at filing duplicate issues today, aren't I? 😄

In that PR, the number of blocks was set to 4 * prop.multiProcessorCount; in a similar way to the other models that need to guess this number (i.e. OpenCL).

I am really winning at filing duplicate issues today, aren't I? 😄

Just shows us that we need to do some housekeeping ASAP... Good to bring this to the top of the pile.

For reference, the 40G A100 is showing:

BabelStream
Version: 4.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 11040
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1391400.012 0.00617     0.00619     0.00618     
Mul         1389127.623 0.00618     0.00620     0.00619     
Add         1398448.366 0.00921     0.00948     0.00946     
Triad       1398717.522 0.00921     0.00949     0.00946     
Dot         1326274.626 0.00648     0.00676     0.00651     

(using the default block size)

Fixed in 092ee67 will be in next release