Optimize `mul_wide()` (i.e. Karatsuba and Friends)
ycscaly opened this issue · 14 comments
We currently use the inefficient schoolbook multiplication for mul_wide
. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a TODO left in the source code to do so.
This is important for our purposes as optimizing mul_wide()
will optimize mul_montgomery_form()
in DynResidue
and Residue
, which is a bottleneck for pow()
which is, in turn, a bottleneck for our system. In fact, mul_montgomery_form()
is just a mul_wide()
followed by a montgomery_reduction()
.
I evaluated experimentally and benchmark results showed that running time for mul_wide()
and montgomery_reduction()
are equal at roughly 4µs
, :
wrapping ops/mul, U4096*U4096
time: [4.2064 µs 4.2192 µs 4.2324 µs]
wrapping ops/rem, U4096/U4096
time: [555.09 ns 570.83 ns 589.77 ns]
wrapping ops/rem, U8192/U4096, full size
time: [1.7831 ms 1.7876 ms 1.7934 ms]
Montgomery arithmetic/DynResidue retrieve
time: [4.1380 µs 4.1579 µs 4.1820 µs]
Montgomery arithmetic/multiplication, U4096*U4096
time: [8.4631 µs 8.5012 µs 8.5433 µs]
Montgomery arithmetic/modpow, U4096^U4096
time: [35.566 ms 36.803 ms 38.838 ms]
From our calculation, Karatsuba improves by a factor of 12.62x
. This would leave the main bottleneck for mul_montgomery_form()
to be montgomery_reduction()
, achieving a 2x
improvement. However, it also seems that it would actually be cheaper to perform a mul_wide()
followed by an rem
- which is 8x
cheaper than montgomery_reduction()
- over the integers, which could actually leave rem
as the bottleneck, achieving roughly 8x
improvement of mul_montgomery_form()
! This should translate directly to a similar improvement in pow_montgomery_form()
, which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs 35ms
in crypto-bigint now, and 13ms
in rust-gmp
, so either I'm missing something or we can out-achieve them by 3x
.)
Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.
We can therefore help with this process.
See also: #67
Is there a reason it wasn't completed/merged? I see activity stopped a year ago. We would like to help move this forward actually.
The author of #67 (@pdogr) became unresponsive. The PR is also out of date and in need of a rebase, and contained undesirable breaking changes which have open comment threads
I am willing to pick up that work if you'll have time reviewing it. I'll have a look at what he's done, and either continue the effort or start from scratch.
Sounds good
Diving into this again, we have a few insights:
- Karatsuba is best for
mul_wide
's purposes. This is true taking into account that basic operations are onLimb
which is 64-bit and not on bits. Since theUint
s used are small factors ofLimb
(i.e. in the hundreds and not thousands or more) Karatsuba (i.e. settingk = 2
for Toom-Cook) is optimal. (This means that the optimization we discussed in the newly released article by my colleague is irrelevant here.) - Our calculation was wrong for the same reason - we initially calculated base-operations as bits and not
Limb
size, which got us to multiplying big-ints composed of thousands of limbs (e.g.U4096
being 4096 bits) but when we accountedLimb
we got to singles or dozens (e.g.U4096
being 64Limb
s). This means that the optimization is much smaller than we have originally thought (e.g. ~2x improvement for 4096-bit multiplication.)
This also de-motivates us a little since the improvement is not as big as we originally thought, so this goes back to our backlog but perhaps we could find time to do this in the next few months.
Implementing this should now arrive at results on-par with rust-gmp
instead of exceeding it, which makes sense (again, due to our error in calculation.)
3. There still is an open question around Residue
or Montgomery in general; the fact that rem
is (order of magnitude) faster then montgomery_reduction
, and, as far as I understand it, could replace it changing correctness, still remains unexplained.
As I said before, mul_montgomery_form
does a mul_wide
followed by a montgomery_reduction
. If instead we would do a mul_wide
followed by a rem
we could gain a ~2x factor very easily. Why not do it? And what are the benefits of using Montgomery in the first place, if working over the integers is faster? I must be missing something, but thought it would be better to ask it here.
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.
I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.
I think that's unavoidable without lazy reductions?
And what are the benefits of using Montgomery in the first place, if working over the integers is faster?
Montgomery form is typically useful when performing many operations which are faster in Montgomery form before converting back into canonical form, so the overhead of the conversions in/out of Montgomery form are outweighed by the savings.
There are often better algorithms if you're performing a single operation.I see the logic, it's just that in practice (at least for multiplication) that's not the case, as there is a Montgomery reduction happening in every operation in the current code.
After re-writing and running my benchmarking I of course found out you are right, and the error was on my side. Sorry for that.
In that case, it seems that in order to optimize mul_montogmery_form()
we are left solely with implementing Karatsuba, which we can expect a 25% improvement from our previous calculation (2x
improvement of mul_wide()
, which is about half of the time mul_montgomery_form()
takes.)
I should probably note we do have a Barrett reduction implementation here, but it's specialized to P-256's limb count: https://github.com/RustCrypto/elliptic-curves/blob/d1ea1a3/p256/src/arithmetic/scalar/scalar64.rs
What's the current status of this? Is there already a working implementation of Karatsuba multiplication or does that still need to be written?