Binary GCD vs. c++ std::gcd benchmark.
sofia-snow opened this issue · 3 comments
Hello,
Not sure how you benchmarked the GCD code but on my laptop (with Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
), std::gcd
is significantly faster than the optimized binary GCD implementation. (Edit: I accidentally left my benchmark on a pathological test case instead... whoops.)
https://godbolt.org/z/3475szdz8
std::gcd 6.6 ms / 1048576 iterations = 6.294 nanoseconds
binary_gcd 23 ms / 1048576 iterations = 21.934 nanoseconds
Benchmark 1: ./std_gcd
Time (mean ± σ): 6.6 ms ± 0.7 ms [User: 6.2 ms, System: 0.6 ms]
Range (min … max): 6.1 ms … 9.3 ms 252 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./binary_gcd
Time (mean ± σ): 23.0 ms ± 1.7 ms [User: 22.6 ms, System: 0.6 ms]
Range (min … max): 22.3 ms … 30.4 ms 119 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'./std_gcd' ran
3.35 ± 0.32 times faster than './binary_gcd'
Performance counter stats for './std_gcd':
11.17 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
100 page-faults:u # 8.949 K/sec
30,609,505 cycles:u # 2.739 GHz
31,468,434 instructions:u # 1.03 insn per cycle
5,564,990 branches:u # 498.021 M/sec
8,829 branch-misses:u # 0.16% of all branches
0.011527144 seconds time elapsed
0.011597000 seconds user
0.000000000 seconds sys
Performance counter stats for './binary_gcd':
31.39 msec task-clock:u # 0.986 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
98 page-faults:u # 3.122 K/sec
99,257,572 cycles:u # 3.162 GHz
235,231,331 instructions:u # 2.37 insn per cycle
18,247,030 branches:u # 581.359 M/sec
628,198 branch-misses:u # 3.44% of all branches
0.031830958 seconds time elapsed
0.031932000 seconds user
0.000000000 seconds sys
Eek. My benchmark test was flawed. gcd(i, i+1)
was a lazy and pathological test case. Replaced with a proper gcd(i, j)
for all i,j in [0, 1024), as expected the binary GCD is indeed faster. ~2.77x faster for me.
Benchmark 1: ./std_gcd
Time (mean ± σ): 38.7 ms ± 2.4 ms [User: 38.1 ms, System: 0.8 ms]
Range (min … max): 37.5 ms … 50.1 ms 74 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./binary_gcd
Time (mean ± σ): 13.9 ms ± 1.1 ms [User: 13.6 ms, System: 0.6 ms]
Range (min … max): 13.3 ms … 18.6 ms 189 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'./binary_gcd' ran
2.77 ± 0.28 times faster than './std_gcd'
std::gcd 38.7 ms / 1048576 iterations = 36.90719 nanoseconds
binary_gcd 14.4 ms / 1048576 iterations = 13.73291 nanoseconds
Performance counter stats for './std_gcd':
51.30 msec task-clock:u # 0.985 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
99 page-faults:u # 1.930 K/sec
162,904,210 cycles:u # 3.176 GHz
47,630,250 instructions:u # 0.29 insn per cycle
9,656,563 branches:u # 188.246 M/sec
863,704 branch-misses:u # 8.94% of all branches
0.052098612 seconds time elapsed
0.048194000 seconds user
0.004016000 seconds sys
Performance counter stats for './binary_gcd':
18.85 msec task-clock:u # 0.979 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
98 page-faults:u # 5.198 K/sec
59,685,387 cycles:u # 3.166 GHz
106,500,192 instructions:u # 1.78 insn per cycle
11,381,328 branches:u # 603.637 M/sec
850,311 branch-misses:u # 7.47% of all branches
0.019259380 seconds time elapsed
0.019376000 seconds user
0.000000000 seconds sys
If only I read the book in order. The benchmarking section answers how you did the benchmark and points out the benchmark I intended, was still flawed, as it makes it easier on the branch predictor. ;-)
https://en.algorithmica.org/hpc/profiling/benchmarking/
https://en.algorithmica.org/hpc/profiling/noise/
Sidenote. I found the book because https://algorithmica.org/en/eytzinger was mentioned in Futhark's binary search example, so started with the binary search section. I've linearly just finished section 3. Out of order read several more sections.
I'm really enjoying this book and have shared with friends. Thank you. <3
One minor nit-pick, I find it hard to consider ARM a RISC-based ISA. If you're going to contrast CISC and RISC, ARM belongs under CISC and RISC-V belongs under RISC. I'd rather see RISC-V assembly instead of x86. Though I understand why many x86 specific idioms are mentioned. RISC-V has a set-less-than instruction, so you don't need to SUB, SLL, SUB
to construct a mask; you can just do SLT, ADDI -1
(which could be macro-op fused to run in a single cycle). Similarly RISC-V doesn't have any exceptions or other flags to taint and has single instruction conditional branches.
Two sections I feel are missing: macro-op fusion and out of order execution. RISC-V is a particularly good example where macro-op fusion works very well. Out of order execution typically also includes making register copies and renames "free", thus don't count towards using execution units or causing data hazards.
I feel it might be worth noting the counters for the: cycles, instructions and branches. The binary_gcd
code runs double the instructions, almost 2 million additional branches; yet runs 2.77x faster taking 100 million fewer cycles to evaluate. This is the best example of IPC-matters I have ever seen. :-)