Parallel strided iteration does not scale linearly
mratsim opened this issue · 0 comments
mratsim commented
On the following benchmark on my dual core machine with hyper-threading enabled:
https://github.com/numforge/laser/blob/d725651c0f8ca1da3f761e27e9846bc81b9341f3/benchmarks/loop_iteration/iter_bench_prod.nim (branch foreach-dsl #4)
I get the following score without OpenMP
Warmup: 1.1911 s, result 224 (displayed to avoid compiler optimizing warmup away)
Production implementation for tensor iteration - float64
Collected 1000 samples in 8.931 seconds
Average time: 8.930ms
Stddev time: 0.345ms
Min time: 8.723ms
Max time: 14.785ms
Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413
Production implementation for tensor iteration - float64
Collected 1000 samples in 29.597 seconds
Average time: 29.597ms
Stddev time: 1.125ms
Min time: 27.002ms
Max time: 38.606ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
and with OpenMP
Warmup: 1.1874 s, result 224 (displayed to avoid compiler optimizing warmup away)
Production implementation for tensor iteration - float64
Collected 1000 samples in 4.094 seconds
Average time: 4.092ms
Stddev time: 0.206ms
Min time: 3.897ms
Max time: 5.459ms
Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413
Production implementation for tensor iteration - float64
Collected 1000 samples in 24.025 seconds
Average time: 24.022ms
Stddev time: 1.127ms
Min time: 22.379ms
Max time: 33.763ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
Potential explanations:
- hitting the maximum memory bandwidth:
- Parallel strided iteration also means parallel cache misses.
- One way to alleviate that would be to use GCC/Clang builtin_prefetch
- Note: this does not reflect experiment in https://github.com/zy97140/omp-benchmark-for-pytorch. Furthermore next element computation shouldn't require any memory access.
Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
- accesses to shape and strides for strided iteration are slow:
- This might happen if the compiler doesn't create a local copy or save them in registers if it doesn't detect that no shape/strides access mutate them.
- strided iteration is running out of registers and/or instruction cache space
- Unlikely are those are per-CPU and not per socket we should have a linear increase