Low-level big-integer arithmetic subroutines in C/C++/asm for modern Intel or AMD CPU.

mul8_zen() multiplies 512-bit (8-limb) numbers faster than GMP subroutine gmpn_mul(,, 8, , 8) (both on Skylake and Ryzen). Such hand-optimized assembler subroutines exist for sizes 4…11.

Templated subroutine toom22_broadwell_t<N> multiplies N-limb numbers faster than gmpn_mul(,, N, , N), for small values of N.

Status

Work-in-progress. Code needs cleaning. Some subroutines might not work as expected. However, if there is a benchmark published for a procedure, then this procedure is thoroughly tested and expected to be bug-free.

Currently my code outperforms GMP for at least the following limb sizes: 4…20, 22, 24, 32, 48, 64, 72, 127. I am using Skylake and Ryzen desktops to run benchmarks (CPUs: Core i7-6700 @3.40GHz and Ryzen 7 3800X 8-Core).

Quick start

python2 configure.py
ninja
automagic/test4.exe
automagic/benchm8_official.exe 0
ninja more_benchmarks
automagic/benchm8_basecase.exe 0
automagic/benchm8_toom22_t.exe 0

, where 0 is a randomness generator seed.

Benchmarks

subroutine	tacts
mpn_mul_n 8	143
gmpn_mul_basecase 8	133
mul8_aligned	104
mpn_mul_n 16	489
gmpn_mul_basecase 16	478
toom22_xx_broadwell 16	402
toom22_broadwell_t 16	392
mpn_mul_n 24	1086
toom22_broadwell_t 24	967
mpn_mul_n 32	1678
toom22_deg2_broadwell 32	1344

Left column contains subroutine name and size indicator. For instance, gmpn_mul_basecase 16 indicates that gmpn_mul_basecase(target, operand_a, 16, operand_b, 16) was called.

Right column contains count of CPU tacts (ticks), as returned by __rdtsc().

GMP benchmarks (..._basecase, mpn_mul_n) are for GMP version 6.1.2, GCC version 9.3.0.

For stable result, don't forget to turn off turbo boost and fix CPU frequency when benchmarking.

Requirements

To generate and compile code, Python2 interpreter, ninja utility, GCC C/C++ compiler on X86_64 Linux or compatible OS.
To benchmark code, Intel Broadwell or AMD Ryzen or better CPU (capable to adcx/adox).

License

For code, GNU Lesser General Public License version 3.

Questions and answers

Q0. I want to link against GMP library hand-installed at /opt/gmp.

A0. Supposing gmp static library is at /opt/gmp/lib64/libgmp.a, call configure.py like this:

c_compiler=/opt/bin/gcc-9.9 \
cpp_compiler=/opt/bin/g++-9.9 \
flags='-O9 -march=broadwell -static' \
gmp_location=/opt/gmp/lib64/libgmp.a \
/usr/bin/python2 \
configure.py

where /usr/bin/python2 is full path to your Python2 interpreter, flags are C and C++ compiler flags.

Q1. Can I use other compiler than GCC?

A1. Maybe, if it fully supports GCC-style inline asm. Oops, clang rejects to compile due to if constexpr problem.

Q2. I found a bug in your code.

A2. File bugreport. Include value of input parameters, received output and expected output. Prefererably in a form of compilable C or C++ program.

Q3. Can I use your code under OS other than Linux?

A3. If this OS uses the same calling convention (System V ABI for 64-bit Unix systems), then probably yes. If not, then some subroutines are unusable.

Q4. My benchmark of GMP code shows bigger tact numbers than found above in this README.

A4. Did you install GMP via packet manager? Probably it is not optimized for your Broadwell or Ryzen. Symptom: no mulx and/or adox instruction in disassembled code.

Q5. How do I know that installed GMP library is optimized for Broadwell or better CPU?

A5. Disassemble it, unpack source code

cd /tmp
objdump -d /usr/lib64/libgmp.a > gmp.asm
tar xf /distfiles/gmp-6.2.1.tar.xz

, open gmp.asm, look at __gmpn_mul_basecase subroutine. The code should be similar to source found in /tmp/portage/dev-libs/gmp-6.2.1/work/gmp-6.2.1/mpn/x86_64/coreibwl/mul_basecase.asm. If optimized for Zen2, then source code is in .../mpn/x86_64/zen1/ directory.

Q6. My packet manager installed crippled (not microarch-optimized) GMP. How do I build GMP optimized for my CPU?

A6. Under Gentoo Linux, use my ebuild script. If you are not using portage packet manager, follow official build instructions.

Q7. Why do you generate C/assembler code instead of writing it directly?

A7. Because it is a lot easier to write and read

mulx 8(up), w3, w4

than

" mulx 8(%[up]), %[w3], %[w4]\n"   \

. Note that GMP developers use a loop-optimizing software and m4 macro processor.

Q8. Why do you generate ninja.build instead of writing it directly?

A8. Because ninja is about speed, not user convenience. ninja.build is redundant and unreadable, compared to Makefile. It is hard and unpleasant to manually edit ninja.build.

Q9. How do I remove all generated code to start from scratch?