Optimize Bernstein-Yang for 32-bit targets
tarcieri opened this issue · 0 comments
Bernstein-Yang as described in the original paper uses 62-bit limbs and is optimized for 64-bit targets: https://eprint.iacr.org/2019/266
Section 12.3 of the paper suggests it can be better optimized for 32-bit targets:
12.3. Other CPUs. Our advantage is larger on platforms with smaller multipliers. For
example, we have written software to invert modulo 2^255 − 19 on an ARM Cortex-A7,
obtaining a median cycle count of 35277 cycles. The best previous Cortex-A7 result we
have found in the literature is 62648 cycles reported by Fujii and Aranha [34], using
Fermat’s method. Internally, our software does repeated calls tojump32divsteps2
, units
of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purpose
registers.
However, I wasn't able to find more information about adapting the algorithm to 32-bit targets beyond this, nor was I able to find any more information about jump32divsteps2
.
I attempted to naively translate the implementation to using 30-bit limbs as suggested by the "units of 30 iterations" in #372 but was unable to get it to work.