RustCrypto/crypto-bigint

Optimize Bernstein-Yang for 32-bit targets

tarcieri opened this issue · 0 comments

Bernstein-Yang as described in the original paper uses 62-bit limbs and is optimized for 64-bit targets: https://eprint.iacr.org/2019/266

Section 12.3 of the paper suggests it can be better optimized for 32-bit targets:

12.3. Other CPUs. Our advantage is larger on platforms with smaller multipliers. For
example, we have written software to invert modulo 2^255 − 19 on an ARM Cortex-A7,
obtaining a median cycle count of 35277 cycles. The best previous Cortex-A7 result we
have found in the literature is 62648 cycles reported by Fujii and Aranha [34], using
Fermat’s method. Internally, our software does repeated calls to jump32divsteps2, units
of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purpose
registers.

However, I wasn't able to find more information about adapting the algorithm to 32-bit targets beyond this, nor was I able to find any more information about jump32divsteps2.

I attempted to naively translate the implementation to using 30-bit limbs as suggested by the "units of 30 iterations" in #372 but was unable to get it to work.