Halt early in a n_by_n multiplication loop if numbers are small enough

Question

Halt early in a n_by_n multiplication loop if numbers are small enough

maksymbevza opened this issue a year ago · 16 comments

Hello,
I like the library very much, thanks a lot for building and actively supporting it.

I faced the issue that when multiplying numbers of small size, that are stored in big types, the computation is still executed in full-loop. Luckily not O(n^2) but rather O(n) due to the outer loop checking for the 0-value of a limb.

Say for example we compute 11111 * 22222 using 512-bit unsigned type (32-bit limb). We'll have just the very first limbs non-zero for both values. The outer loop will ensure that if the limb value is 0 we don't run the inner loop. However, if it's not 0, the second loop will execute to the very end (to count-i to be more precise, which is helpful).
More precisely, this line continues to execute: https://github.com/ckormanyos/wide-integer/blob/master/math/wide_integer/uintwide_t.h#L3785

The suggestion is to compute the number of non-zero limbs for the second number and run the loop based on that limit.

Another example of when this is useful is when the first number is big (taking all of the limbs) but the second one is small and takes just a single limb. Another way to address this very case is to swap numbers on multiplication to multiply small by large.

Answer 1 · 2023-06-06T15:27:56.000Z

Be aware that an optimization like this could enable a side channel attack. That's fine depending on the application.

…

On Tue 6 Jun 2023, 13:06 Maksym Bevza, ***@***.***> wrote: Hello, I like the library very much, thanks a lot for building and actively supporting it. I faced the issue that when multiplying numbers of small size, that are stored in big types, the computation is still executed in full-loop. Luckily not O(n^2) but rather O(n) due to the outer loop checking for the 0-value of a limb. Say for example we compute 11111 * 22222 using 512-bit unsigned type (32-bit limb). We'll have just the very first limbs non-zero for both values. The outer loop will ensure that if the limb value is 0 we don't run the inner loop. However, if it's not 0, the second loop will execute to the very end (to count-i to be more precise, which is helpful). More precisely, this line continues to execute: https://github.com/ckormanyos/wide-integer/blob/master/math/wide_integer/uintwide_t.h#L3785 The suggestion is to compute the number of non-zero limbs for the second number and run the loop based on that limit. Another example of when this is useful is when the first number is big (taking all of the limbs) but the second one is small and takes just a single limb. Another way to address this very case is to swap numbers on multiplication to multiply small by large. — Reply to this email directly, view it on GitHub <#362>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTGN5ZCGQC532P4F7IV5LXJ4MLDANCNFSM6AAAAAAY4J3OFU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2023-06-06T17:23:04.000Z

@johnmcfarlane Indeed it can, thanks for pointing it out. We can guard it with -D at compile time or so. Also, I think it's already susceptible to this attack, due to the check if the first number's limb value is not zero.

Answer 3 · 2023-06-07T05:46:55.000Z

Thanks for your interest and these optimization suggestions. If I recall, division counts the number of leading zero limbs for optimization. So we can discuss this also perhaps for multiplication. My first fear would be that it would slow down all multiplication for the sake of a few selected operations that happen to use small numbers. But we could selectively enable/disable as suggested. I’m traveling now and will look into this next week

Answer 4 · 2023-06-11T09:00:40.000Z

Hi Maksym (@maksymbevza) I've started looking into these kinds of optimizations in the clz_limb_optimizations_issue362 branch.

I'm considering using -DWIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS for these optimizations

In other words:

#define WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS

I started off a full batch of CI with WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS defined. Local tests were OK on the first-tried optimization. Let's see how it does with full sanitizers and all of CI.

Answer 5 · 2023-06-11T10:37:20.000Z

Hello, @ckormanyos. Thanks a lot.

I spent time again trying to understand the biggest bottleneck and use case. Turns out we have the biggest use by the function which performs multiplication of who uint256 values a and b and division by another uint256 value c. It has logic guarantees that the result of division is uint256 and it converts back.

In the a*b/c formula I conditionally swapped a, b to make sure a < b, just in case to reuse the optimization of skipping the whole inner loop cycle on short a values provided by the library.

However, most of the time both a and b use 120-200 bits, which means this is not the main optimization point. Luckily since multiplication is performed under uint512 to hold the multiplied value, I know that 50% of higher limbs will be always empty for both a and b, which pretty much guarantees the use of the suggested optimization.

Also, I see that division can be yet another bottleneck, but I have not investigated it yet that much.

--

Another thing I thought about is using uint512 with 64-bit limb, unrolling 8x8 multiplication, and using avx2 which is widely adopted already in cloud platforms. Not sure if the compiler is clever enough to automatically construct relevant assembly to use avx2 just by enabling the -march=haswell/-mavx2 flag and no changes in the code itself.

By the way, on the topic of 64-bit limbs, I see that using 64-bit limbs hurt the performance bit of our function, and not it's not clear to me yet as to why. I suspect that conversion from 32-bit limbs to 64 and back is more expensive than the gain from it.

If you prefer, I can create another issue on the avx2 specifically to discuss it in more detail.

Answer 6 · 2023-06-11T11:01:17.000Z

Hi Maksym (@maksymbevza) it's good to read of your progress.

I have now tested counting leading zero limbs. I actually think I can/will check that in to the main-branch soon with the compiler switch WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS.

For the 8-by-8 case, believe it or not, there is already an unrolled multiplication. To get this, you need to define the preprocessor switch WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL.

You can try the 8-by-8 switch and see if it helps your use-case.

In general, I'm reluctant-to-refusing to implement specific architecture assembly. I use this library with about 15-20 different architectures and I'm happy with its portability and overall performance. So it is unlikely that I will support an individual architectural optimization any time soon. You can do this in your fork, and, especially with the unrolled 8-by-8 multiplication, you might be able to see a simple way to put your own assembler in there.

Answer 7 · 2023-06-11T11:05:48.000Z

I can create another issue on the avx2 specifically to discuss it...

Hi Maksym (@maksymbevza) I am happy to discuss this. As mentioned above, please try first WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL.

As mentioned above, however, I'm unlikely to support machine-specific details. But I could discuss with you how you might be able to get these in something like a fork.

Answer 8 · 2023-06-11T11:14:58.000Z

Sure, totally understand your point on architectural assembly.

Yes, I've seen WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL and it's implementation. Will do more of the benchmarking soon.

Thanks a lot for the help in optimizing this very point, will try to investigate the actual performance gain.

Answer 9 · 2023-06-11T11:29:41.000Z

@ckormanyos Could you please hint if there's a quick way to cast from/to 64-bit limb type to 32-bit limb type integers? At the moment we use string-based conversion which is obviously very slow. I've seen import/export bits that probably can be used for this, but maybe there is another quick way?

Answer 10 · 2023-06-11T12:18:54.000Z

Hi Maksym, in addition, there is a static member function called from_rep() that can be used directly for limb manipulations. But you would be responsible for shifting/halving/doubling the to-from limb types in the representation(s).

Also, I must admit, from_rep() is not quite as well tested as some other parts of the library.

Answer 11 · 2023-06-11T12:24:19.000Z

Regarding from_rep(), basically you need to get the representation of the source type, then manipulate its limbs manually. These manipulated limbs are then used in a specific call to from_rep() which acts like a factory to create the new wide-integer having different limb width.

I have not (yet) and might never support mixed limb-width conversions.

Answer 12 · 2023-06-11T13:11:24.000Z

Hi Maksym (@maksymbevza) the optimizations we've discussed so far and agreed to attempt, whether they turn out to be needed or not, are cycling for inclusion in master in #363.

As soon as the PR gets merged, you can use the compile-time definition WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS to activate these.

These optimizations are disabled by default and the library has not significantly been changed in its default state.

Answer 13 · 2023-06-11T14:12:44.000Z

Hi Maksym (@maksymbevza), the OP is preliminarily handled by #363, which is now merged to master.

I will, for the moment, keep this issue open. Moving forward, we can open secondary issue(s) for additional points discovered herein, as needed.

Answer 14 · 2023-06-11T19:58:14.000Z

As soon as the PR gets merged, you can use the compile-time definition WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS to activate these.

The first round of optimizations is now in master.

Answer 15 · 2023-06-16T11:42:18.000Z

Hello

The results of the speed-up are incredible, thanks a lot. Measured under google benchmark is the following.
All measurements below are under uint512 32-bit limb type. All test cases have only leading zero limbs, but never non-leading zero limbs (e.g. 0x0000...123456, but not 0x000..000123000...000). Tested on Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz

-  90-bit by  90-bit :  37ns ->  20ns (+85% op/s)
- 256-bit by 256-bit :  66ns ->  58ns (+13% op/s)
- 512-bit by 512-bit :  86ns -> 104ns (-17% op/s)

In our case we are mostly between 90 and 256 bits, and never over 256 bits for operands, yielding considerable, statistically significant improvement in multiplication.

Thanks again, @ckormanyos, for both building the library and for building the effective optimization.

Answer 16 · 2023-06-16T20:11:09.000Z

The results of the speed-up are incredible, thanks a lot.

Great! I am happy this helped. This change was a great experience in the domain of local, application-specific optimizatoin. We reachad, I believe, a good point.

Thanks Maksym (@maksymbevza) for your feedback and for your original, good ideas.

If anything else comes up or you have other ideas, please contact any time.

Kind regards, Chris

I'll close this issue now with: Fixed by 6a3777b