RustCrypto/crypto-bigint

Optimize `mul_wide()` (i.e. Karatsuba and Friends)

ycscaly opened this issue · 14 comments

We currently use the inefficient schoolbook multiplication for mul_wide. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a TODO left in the source code to do so.

This is important for our purposes as optimizing mul_wide() will optimize mul_montgomery_form() in DynResidue and Residue, which is a bottleneck for pow() which is, in turn, a bottleneck for our system. In fact, mul_montgomery_form() is just a mul_wide() followed by a montgomery_reduction().

I evaluated experimentally and benchmark results showed that running time for mul_wide() and montgomery_reduction() are equal at roughly 4µs, :

wrapping ops/mul, U4096*U4096
                        time:   [4.2064 µs 4.2192 µs 4.2324 µs]                   
wrapping ops/rem, U4096/U4096
                        time:   [555.09 ns 570.83 ns 589.77 ns]
wrapping ops/rem, U8192/U4096, full size
                        time:   [1.7831 ms 1.7876 ms 1.7934 ms]
Montgomery arithmetic/DynResidue retrieve
                        time:   [4.1380 µs 4.1579 µs 4.1820 µs]
Montgomery arithmetic/multiplication, U4096*U4096
                        time:   [8.4631 µs 8.5012 µs 8.5433 µs]
Montgomery arithmetic/modpow, U4096^U4096
                        time:   [35.566 ms 36.803 ms 38.838 ms]

From our calculation, Karatsuba improves by a factor of 12.62x. This would leave the main bottleneck for mul_montgomery_form() to be montgomery_reduction(), achieving a 2x improvement. However, it also seems that it would actually be cheaper to perform a mul_wide() followed by an rem - which is 8x cheaper than montgomery_reduction() - over the integers, which could actually leave rem as the bottleneck, achieving roughly 8x improvement of mul_montgomery_form()! This should translate directly to a similar improvement in pow_montgomery_form(), which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs 35ms in crypto-bigint now, and 13ms in rust-gmp, so either I'm missing something or we can out-achieve them by 3x.)

Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.

We can therefore help with this process.

See also: #67

See also: #67

Is there a reason it wasn't completed/merged? I see activity stopped a year ago. We would like to help move this forward actually.

The author of #67 (@pdogr) became unresponsive. The PR is also out of date and in need of a rebase, and contained undesirable breaking changes which have open comment threads

The author of #67 (@pdogr) became unresponsive. The PR is also out of date and in need of a rebase, and contained undesirable breaking changes which have open comment threads

I am willing to pick up that work if you'll have time reviewing it. I'll have a look at what he's done, and either continue the effort or start from scratch.

Sounds good

Diving into this again, we have a few insights:

  1. Karatsuba is best for mul_wide's purposes. This is true taking into account that basic operations are on Limb which is 64-bit and not on bits. Since the Uints used are small factors of Limb (i.e. in the hundreds and not thousands or more) Karatsuba (i.e. setting k = 2 for Toom-Cook) is optimal. (This means that the optimization we discussed in the newly released article by my colleague is irrelevant here.)
  2. Our calculation was wrong for the same reason - we initially calculated base-operations as bits and not Limb size, which got us to multiplying big-ints composed of thousands of limbs (e.g. U4096 being 4096 bits) but when we accounted Limb we got to singles or dozens (e.g. U4096 being 64 Limbs). This means that the optimization is much smaller than we have originally thought (e.g. ~2x improvement for 4096-bit multiplication.)

This also de-motivates us a little since the improvement is not as big as we originally thought, so this goes back to our backlog but perhaps we could find time to do this in the next few months.

Implementing this should now arrive at results on-par with rust-gmp instead of exceeding it, which makes sense (again, due to our error in calculation.)
3. There still is an open question around Residue or Montgomery in general; the fact that rem is (order of magnitude) faster then montgomery_reduction, and, as far as I understand it, could replace it changing correctness, still remains unexplained.

As I said before, mul_montgomery_form does a mul_wide followed by a montgomery_reduction. If instead we would do a mul_wide followed by a rem we could gain a ~2x factor very easily. Why not do it? And what are the benefits of using Montgomery in the first place, if working over the integers is faster? I must be missing something, but thought it would be better to ask it here.

And what are the benefits of using Montgomery in the first place, if working over the integers is faster?

Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.

There are often better algorithms if you're performing a single operation.

And what are the benefits of using Montgomery in the first place, if working over the integers is faster?

Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.

There are often better algorithms if you're performing a single operation.

I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.

I think that's unavoidable without lazy reductions?

And what are the benefits of using Montgomery in the first place, if working over the integers is faster?

Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.

I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.

After re-writing and running my benchmarking I of course found out you are right, and the error was on my side. Sorry for that.

In that case, it seems that in order to optimize mul_montogmery_form() we are left solely with implementing Karatsuba, which we can expect a 25% improvement from our previous calculation (2x improvement of mul_wide(), which is about half of the time mul_montgomery_form() takes.)

I should probably note we do have a Barrett reduction implementation here, but it's specialized to P-256's limb count: https://github.com/RustCrypto/elliptic-curves/blob/d1ea1a3/p256/src/arithmetic/scalar/scalar64.rs

HastD commented

What's the current status of this? Is there already a working implementation of Karatsuba multiplication or does that still need to be written?

#67 is the only PR currently open for this and it’s stale. Otherwise Karatsuba has not yet been implemented

As of #649 we now support Karatsuba. Closing.