Why blocksize is 256 in gpu-cache test
Opened this issue · 1 comments
blueWatermelonFri commented
Hey, i find in gpu-cache test the blocksize is 256
, why it is not 1024
?
When i changed blocksize from 256
to 1024
, L1 cache bandwidth tested has some improvement and fluctuates more.
blocksize = 256
results as follows
1 kB 50ms 0.7% 8648.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
2 kB 37ms 0.1% 11608.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
3 kB 33ms 0.0% 12947.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
4 kB 31ms 5.4% 14061.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
6 kB 30ms 3.3% 14402.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
8 kB 30ms 6.6% 14989.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
10 kB 30ms 3.0% 14555.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
12 kB 30ms 27.9% 15976.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
14 kB 30ms 5.3% 14430.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
16 kB 30ms 2.2% 14588.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
18 kB 33ms 2.0% 13113.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
20 kB 30ms 17.5% 15206.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
22 kB 29ms 7.9% 15610.4 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
24 kB 28ms 11.8% 15916.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
26 kB 32ms 11.1% 13737.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
28 kB 30ms 5.0% 14240.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
30 kB 31ms 0.6% 14172.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
32 kB 30ms 4.1% 14733.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
34 kB 29ms 2.2% 14845.4 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
36 kB 29ms 3.3% 15113.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
38 kB 29ms 5.4% 14967.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
40 kB 29ms 5.4% 15129.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
42 kB 29ms 8.7% 15437.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
44 kB 29ms 7.0% 15451.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
46 kB 29ms 8.4% 15633.8 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
48 kB 28ms 12.3% 15940.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
50 kB 28ms 16.4% 16288.1 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
52 kB 28ms 14.6% 16230.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
54 kB 28ms 12.6% 16195.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
56 kB 27ms 10.0% 16434.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
58 kB 28ms 11.0% 16433.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
blocksize = 1024
results as follows
data set exec time spread Eff. bw DRAM read DRAM write L2 read L2 store
4 kB 37ms 0.1% 11645.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
6 kB 111ms 0.0% 3902.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
8 kB 29ms 46.0% 17593.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
10 kB 66ms 6.0% 6564.7 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
12 kB 29ms 24.8% 16609.0 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
14 kB 52ms 1.4% 8303.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
16 kB 28ms 27.1% 17275.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
18 kB 44ms 6.6% 9894.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
20 kB 28ms 27.0% 17521.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
22 kB 39ms 7.5% 11307.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
24 kB 27ms 16.9% 17184.6 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
26 kB 37ms 18.0% 12475.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
28 kB 27ms 40.3% 18542.5 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
30 kB 34ms 11.9% 13365.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
32 kB 26ms 20.7% 18043.9 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
34 kB 34ms 23.1% 14124.3 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
36 kB 27ms 26.9% 17707.2 GB/s 0 GB/s 0 GB/s 0 GB/s 0 GB/s
My device is A800 80GB PCIe.
te42kyfo commented
The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work.
In lines 144 forward, only use multiples of 1024 as template parameter. On some GPUs, which do not have a L1 cache as large, the amount of work per thread would be very small, and the performance actually worse.