te42kyfo/gpu-benches

The variables in cache-latency test

Closed this issue · 4 comments

I am confused about the functions of cl_size, cl_lane, and skip_factor. Could you explain the purpose behind the design of these three variables? How do they affect the access pattern of cache or global memory?

The idea with the cl_size is to avoid jumping to a different value of an already accessed cache line. In a purely random pointer chain, there would be a high chance of jumping into an already accessed cache line even if the data set already exceeds a cache level.

After having accessed every cache line once, the chain wraps around to access the next cache line lane in the cache line. The skip factor just skips values entirely, mostly to speed things up.

Thank you for your answers! I still have a few questions:

  1. Why is the default cl_size set to 1? Does this mean the L2 cache line is 64 KB?
  2. Could you clarify which bits of the address determine the cache set mapping? It seems this might influence the selection of the skip_factor.

skip_factor and cl_size do very similar things, except that skipped data is not read at all. I am using a cl_size of 1 and a larger skip_size because long, memory sized data sets take forever otherwise. Since GPUs have either 32B or 64B transfer granularity, the product of skip_factor and cl_size should at least be 32B or 64B divided by the 8B pointer size equals 4 or 8 to avoid jumping into the same cache line again before all of the other cache lines in the data set have been accessed.

The cache set mapping is not linear but swizzled on many GPUs. I haven't reverse engineerd the mapping. Yes, it would be nice to be able to hit every cache set exactly the number of times, to create the sharpest possible cache level transitions, but random accesses come pretty close so being aware of cache sets is not necessary.

Thank you for your patient answer.