algorithmica-org/algorithmica

Binary GCD vs. c++ std::gcd benchmark.

sofia-snow opened this issue · 3 comments

Hello,

Not sure how you benchmarked the GCD code but on my laptop (with Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz), std::gcd is significantly faster than the optimized binary GCD implementation. (Edit: I accidentally left my benchmark on a pathological test case instead... whoops.)

https://godbolt.org/z/3475szdz8

std::gcd     6.6 ms / 1048576 iterations =  6.294 nanoseconds
binary_gcd  23   ms / 1048576 iterations = 21.934 nanoseconds
Benchmark 1: ./std_gcd
  Time (mean ± σ):       6.6 ms ±   0.7 ms    [User: 6.2 ms, System: 0.6 ms]
  Range (min … max):     6.1 ms …   9.3 ms    252 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./binary_gcd
  Time (mean ± σ):      23.0 ms ±   1.7 ms    [User: 22.6 ms, System: 0.6 ms]
  Range (min … max):    22.3 ms …  30.4 ms    119 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  './std_gcd' ran
    3.35 ± 0.32 times faster than './binary_gcd'
 Performance counter stats for './std_gcd':

             11.17 msec task-clock:u              #    0.969 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
               100      page-faults:u             #    8.949 K/sec
        30,609,505      cycles:u                  #    2.739 GHz
        31,468,434      instructions:u            #    1.03  insn per cycle
         5,564,990      branches:u                #  498.021 M/sec
             8,829      branch-misses:u           #    0.16% of all branches

       0.011527144 seconds time elapsed

       0.011597000 seconds user
       0.000000000 seconds sys
 Performance counter stats for './binary_gcd':

             31.39 msec task-clock:u              #    0.986 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                98      page-faults:u             #    3.122 K/sec
        99,257,572      cycles:u                  #    3.162 GHz
       235,231,331      instructions:u            #    2.37  insn per cycle
        18,247,030      branches:u                #  581.359 M/sec
           628,198      branch-misses:u           #    3.44% of all branches

       0.031830958 seconds time elapsed

       0.031932000 seconds user
       0.000000000 seconds sys

Eek. My benchmark test was flawed. gcd(i, i+1) was a lazy and pathological test case. Replaced with a proper gcd(i, j) for all i,j in [0, 1024), as expected the binary GCD is indeed faster. ~2.77x faster for me.

Benchmark 1: ./std_gcd
  Time (mean ± σ):      38.7 ms ±   2.4 ms    [User: 38.1 ms, System: 0.8 ms]
  Range (min … max):    37.5 ms …  50.1 ms    74 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./binary_gcd
  Time (mean ± σ):      13.9 ms ±   1.1 ms    [User: 13.6 ms, System: 0.6 ms]
  Range (min … max):    13.3 ms …  18.6 ms    189 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  './binary_gcd' ran
    2.77 ± 0.28 times faster than './std_gcd'
std::gcd    38.7 ms / 1048576 iterations = 36.90719 nanoseconds
binary_gcd  14.4 ms / 1048576 iterations = 13.73291 nanoseconds
 Performance counter stats for './std_gcd':

             51.30 msec task-clock:u              #    0.985 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                99      page-faults:u             #    1.930 K/sec
       162,904,210      cycles:u                  #    3.176 GHz
        47,630,250      instructions:u            #    0.29  insn per cycle
         9,656,563      branches:u                #  188.246 M/sec
           863,704      branch-misses:u           #    8.94% of all branches

       0.052098612 seconds time elapsed

       0.048194000 seconds user
       0.004016000 seconds sys
 Performance counter stats for './binary_gcd':

             18.85 msec task-clock:u              #    0.979 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                98      page-faults:u             #    5.198 K/sec
        59,685,387      cycles:u                  #    3.166 GHz
       106,500,192      instructions:u            #    1.78  insn per cycle
        11,381,328      branches:u                #  603.637 M/sec
           850,311      branch-misses:u           #    7.47% of all branches

       0.019259380 seconds time elapsed

       0.019376000 seconds user
       0.000000000 seconds sys

If only I read the book in order. The benchmarking section answers how you did the benchmark and points out the benchmark I intended, was still flawed, as it makes it easier on the branch predictor. ;-)

https://en.algorithmica.org/hpc/profiling/benchmarking/
https://en.algorithmica.org/hpc/profiling/noise/


Sidenote. I found the book because https://algorithmica.org/en/eytzinger was mentioned in Futhark's binary search example, so started with the binary search section. I've linearly just finished section 3. Out of order read several more sections.

I'm really enjoying this book and have shared with friends. Thank you. <3

One minor nit-pick, I find it hard to consider ARM a RISC-based ISA. If you're going to contrast CISC and RISC, ARM belongs under CISC and RISC-V belongs under RISC. I'd rather see RISC-V assembly instead of x86. Though I understand why many x86 specific idioms are mentioned. RISC-V has a set-less-than instruction, so you don't need to SUB, SLL, SUB to construct a mask; you can just do SLT, ADDI -1 (which could be macro-op fused to run in a single cycle). Similarly RISC-V doesn't have any exceptions or other flags to taint and has single instruction conditional branches.

Two sections I feel are missing: macro-op fusion and out of order execution. RISC-V is a particularly good example where macro-op fusion works very well. Out of order execution typically also includes making register copies and renames "free", thus don't count towards using execution units or causing data hazards.

I feel it might be worth noting the counters for the: cycles, instructions and branches. The binary_gcd code runs double the instructions, almost 2 million additional branches; yet runs 2.77x faster taking 100 million fewer cycles to evaluate. This is the best example of IPC-matters I have ever seen. :-)