This microbenchmark illustrates how "false sharing" type performance overheads exist for memory accesses that are on nearby, but not the same, cache line when performing certain memory access patterns.
The benchmark is based on the hypothesis that hardware prefetching of cache lines based on the access pattern of a thread can cause contention.
As such, the microbenchmark involves concurrent read/write (strided) access to different blocks of 512 bytes of memory that have K bytes of padding between them for K=64,128,256,512. The idea is that the access pattern will cause cache lines beyond the 512 block of memory to be prefetched and these prefetched cachelines will overlap with the memory concurrently being read/write by other threads.
Example output on an AWS instance 1-socket 8 physical cores Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz:
$ CILK_NWORKERS=8 ./test_memory_perf
Running on 8 logically parallel tasks using 8 actual worker threads
TEST avg for alignment 64: 0.245000
TEST avg for alignment 128: 0.203000
TEST avg for alignment 256: 0.176000
TEST avg for alignment 512: 0.176000
TEST avg for alignment 64: 0.238000
TEST avg for alignment 128: 0.197000
TEST avg for alignment 256: 0.176000
Running on 1 worker (with same number of logical tasks does not have same overhead).
$ CILK_NWORKERS=1 ./test_memory_perf
Running on 8 logically parallel tasks using 1 actual worker threads
TEST avg for alignment 64: 0.835000
TEST avg for alignment 128: 0.836000
TEST avg for alignment 256: 0.839000
TEST avg for alignment 512: 0.839000
TEST avg for alignment 64: 0.835000
TEST avg for alignment 128: 0.835000
TEST avg for alignment 256: 0.839000
TEST avg for alignment 512: 0.839000