Surprisingly slow constant time selection
randombit opened this issue · 6 comments
I'm not sure if this is really an issue or not.
I wrote a dedicated algorithm for computing g*x+h*y
where g
and h
are fixed, for creating/checking Pedersen commitments. I won't go into the details but the gist is that we perform 128 iterations of a loop, in each loop we select an element out of a table of 15 points, and then add the point we selected into an accumulator.
I was surprised to find that changing the constant time lookup to a simple variable time index caused performance to change from ~98 µs to under 50 µs. Which suggests that the cost of 15 constant time selects is ~= the cost of a point addition. This seems high?
I checked the generated asm (using cargo-show-asm
) and the main thing that jumped out at me was that while Rust was inlining the points conditional_select
, it was not inlining the conditional_select
on the field:
call qword ptr [rip + <p256::arithmetic::field::FieldElement as subtle::ConditionallySelectable>::conditional_select@GOTPCREL]
(there are 3 calls in each loop for x,y,z)
My actual code is using some wrapper types around k256
/p256
(which all seem to be getting inlined as expected), but it basically the lookup looks like this:
// returns identity if index == 0, otherwise from[index-1]
fn ct_select(from: &[Point], index: usize) -> Point {
use subtle::ConstantTimeEq;
let mut val = identity_element();
let index = index.wrapping_sub(1);
for v in 0..from.len() {
val.conditional_assign(&from[v], usize::ct_eq(&v, &index))?;
}
val
}
I was wondering if the conditional select being so slow / not being inlined is expected, or if there is anything I can do to improve the situation.
What version of p256
are you using?
Have you tried master
? It includes #885 (unreleased) which might have effects on performance.
Otherwise there are quite a few different crates involved here: p256
, primeorder
, and crypto-bigint
, and inlining needs to happen across all of them.
If you can use [patch.crates-io]
in Cargo.toml to point to local copies of these crates, perhaps you could try adding #[inline]
or #[inine(always)]
attributes and see if that improves inlining.
Here is a good candidate, for example: https://github.com/RustCrypto/elliptic-curves/blob/366f4f3/p256/src/arithmetic/field.rs#L476
This is p256 0.13.2
and seeing similar results (almost identical tbh) with k256 0.13.
I will try master
and then some inline annotations, thanks for the suggestions.
We've definitely run into these sorts of cross-crate inlining problems before.
I'm also curious if LTO might help?
As far as I can tell #885 doesn't have an effect either way on the performance I'm seeing.
Using #[inline(always)]
on the various implementations of conditional_select
in p256
and k256
helped significantly. The performance of my multiplication algo changed:
Using p256
was 97 µs became 75 µs with inline annotations
Using k256
was 99 µs became 84 µs with inline annotations
Cool, feel free to open a PR.
I'd be curious to know if similar annotations in crypto-bigint
and primeorder
would help. It still seems slow to me.
At this point the difference in performance I'm getting for variable time lookup vs a constant time look up is about 25 µs, and we are performing 15 x 128 selections total, which works out to ~13 ns per selection or ~~ 50-60 cycles. And this is for projective coordinates so we must unavoidably read 2x3x256 bits and then write 3x256 bits, plus masking. And the asm that I'm seeing from rustc
looks quite reasonable - the mask is expanded into xmm
register (once) and then used to combine with the limbs of the 3 field elements using pandn
/por
.
I'll close this since I don't think there is anything else that can be done within p256
/k256
to improve things - the only remaining opportunities that seem likely are using affine coordinates (which would help greatly since we could also then use the mixed addition) and enabling AVX2 to allow for wider memory moves during element selection, both of which are on me.
Thanks for the suggestions and the fast merge.