te42kyfo/gpu-benches

Why blocksize is 256 in gpu-cache test

Opened this issue · 1 comments

Hey, i find in gpu-cache test the blocksize is 256, why it is not 1024

When i changed blocksize from 256 to 1024, L1 cache bandwidth tested has some improvement and fluctuates more.

blocksize = 256 results as follows

         1 kB        50ms       0.7%    8648.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         2 kB        37ms       0.1%   11608.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         3 kB        33ms       0.0%   12947.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         4 kB        31ms       5.4%   14061.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB        30ms       3.3%   14402.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        30ms       6.6%   14989.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        30ms       3.0%   14555.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        30ms      27.9%   15976.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        30ms       5.3%   14430.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        30ms       2.2%   14588.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        33ms       2.0%   13113.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        30ms      17.5%   15206.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        29ms       7.9%   15610.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        28ms      11.8%   15916.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        32ms      11.1%   13737.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        30ms       5.0%   14240.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        31ms       0.6%   14172.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        30ms       4.1%   14733.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        29ms       2.2%   14845.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        29ms       3.3%   15113.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        38 kB        29ms       5.4%   14967.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        40 kB        29ms       5.4%   15129.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        42 kB        29ms       8.7%   15437.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        44 kB        29ms       7.0%   15451.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        46 kB        29ms       8.4%   15633.8 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        48 kB        28ms      12.3%   15940.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        50 kB        28ms      16.4%   16288.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        52 kB        28ms      14.6%   16230.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        54 kB        28ms      12.6%   16195.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        56 kB        27ms      10.0%   16434.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        58 kB        28ms      11.0%   16433.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

blocksize = 1024 results as follows

     data set   exec time     spread        Eff. bw       DRAM read      DRAM write         L2 read       L2 store
         4 kB        37ms       0.1%   11645.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB       111ms       0.0%    3902.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        29ms      46.0%   17593.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        66ms       6.0%    6564.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        29ms      24.8%   16609.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        52ms       1.4%    8303.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        28ms      27.1%   17275.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        44ms       6.6%    9894.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        28ms      27.0%   17521.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        39ms       7.5%   11307.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        27ms      16.9%   17184.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        37ms      18.0%   12475.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        27ms      40.3%   18542.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        34ms      11.9%   13365.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        26ms      20.7%   18043.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        34ms      23.1%   14124.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        27ms      26.9%   17707.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

My device is A800 80GB PCIe.

The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work.

In lines 144 forward, only use multiples of 1024 as template parameter. On some GPUs, which do not have a L1 cache as large, the amount of work per thread would be very small, and the performance actually worse.