Compare/incorporate existing Montgomery modmul implementations
unzvfu opened this issue · 2 comments
There are a few existing open Montgomeral modmul implementations which might (i) form a good/better basis for our improvements in Plonky, or (ii) in any case, they should at least give some ideas for improvements that we might not have thought of. We are initially interested in the Rust implementation; assembly implementation will come later.
Some implementations to consider:
👍. I originally did interleaved Montgomery multiplication because it seemed popular in the literature (e.g. here), but that might have been a mistake, since separate mul and reduce methods would make it easier to optimize squaring and skip certain reductions.
Another option (not that we need to study them all) is curve25519-dalek's [u64; 5]
backend. It seems relatively fast, and they also have a SIMD backend.
Oh, I had forgotten about curve25519-dalek, thanks for reminding me! Added to the list. They presumably use the fast reduction for the modulus 2^255-19
, which will be relevant wrt #71.
Generally the interleaved Monty modmul is a bit faster than mult+REDC when you're in the range where 'schoolbook multiplication' is fastest. Definitely not a mistake! If you have bigger numbers, ones where you'd want to use Karatsuba or FFT multiplication (probably not relevant for us), or if you want to take advantage of fast squaring (as we do), then mult/sqr + REDC is the way to go. The question for us will be whether the 'fast squaring' trick is actually faster with the fairly small numbers we're dealing with. [Fast squaring can work with CIOS; see updated comments on #70.]
Also worth noting that fast reduction mod 2^b + c
doesn't (necessary) need Monty representation at all.