Surprisingly slow constant time selection

Question

Surprisingly slow constant time selection

randombit opened this issue a year ago · 6 comments

I'm not sure if this is really an issue or not.

I wrote a dedicated algorithm for computing g*x+h*y where g and h are fixed, for creating/checking Pedersen commitments. I won't go into the details but the gist is that we perform 128 iterations of a loop, in each loop we select an element out of a table of 15 points, and then add the point we selected into an accumulator.

I was surprised to find that changing the constant time lookup to a simple variable time index caused performance to change from ~98 µs to under 50 µs. Which suggests that the cost of 15 constant time selects is ~= the cost of a point addition. This seems high?

I checked the generated asm (using cargo-show-asm) and the main thing that jumped out at me was that while Rust was inlining the points conditional_select, it was not inlining the conditional_select on the field:

        call qword ptr [rip + <p256::arithmetic::field::FieldElement as subtle::ConditionallySelectable>::conditional_select@GOTPCREL]

(there are 3 calls in each loop for x,y,z)

My actual code is using some wrapper types around k256/p256 (which all seem to be getting inlined as expected), but it basically the lookup looks like this:

    // returns identity if index == 0, otherwise from[index-1]
    fn ct_select(from: &[Point], index: usize) -> Point {
        use subtle::ConstantTimeEq;
        let mut val = identity_element();

        let index = index.wrapping_sub(1);
        for v in 0..from.len() {
            val.conditional_assign(&from[v], usize::ct_eq(&v, &index))?;
        }

        val
    }

I was wondering if the conditional select being so slow / not being inlined is expected, or if there is anything I can do to improve the situation.

Answer 1 · 2023-10-13T16:55:12.000Z

What version of p256 are you using?

Have you tried master? It includes #885 (unreleased) which might have effects on performance.

Otherwise there are quite a few different crates involved here: p256, primeorder, and crypto-bigint, and inlining needs to happen across all of them.

If you can use [patch.crates-io] in Cargo.toml to point to local copies of these crates, perhaps you could try adding #[inline] or #[inine(always)] attributes and see if that improves inlining.

Here is a good candidate, for example: https://github.com/RustCrypto/elliptic-curves/blob/366f4f3/p256/src/arithmetic/field.rs#L476

Answer 2 · 2023-10-13T18:18:55.000Z

This is p256 0.13.2 and seeing similar results (almost identical tbh) with k256 0.13. I will try master and then some inline annotations, thanks for the suggestions.

Answer 3 · 2023-10-13T18:54:45.000Z

We've definitely run into these sorts of cross-crate inlining problems before.

I'm also curious if LTO might help?

Answer 4 · 2023-10-18T17:37:52.000Z

As far as I can tell #885 doesn't have an effect either way on the performance I'm seeing.

Using #[inline(always)] on the various implementations of conditional_select in p256 and k256 helped significantly. The performance of my multiplication algo changed:

Using p256 was 97 µs became 75 µs with inline annotations
Using k256 was 99 µs became 84 µs with inline annotations

Answer 5 · 2023-10-18T17:42:48.000Z

Cool, feel free to open a PR.

I'd be curious to know if similar annotations in crypto-bigint and primeorder would help. It still seems slow to me.

Answer 6 · 2023-10-19T18:43:14.000Z

At this point the difference in performance I'm getting for variable time lookup vs a constant time look up is about 25 µs, and we are performing 15 x 128 selections total, which works out to ~13 ns per selection or ~~ 50-60 cycles. And this is for projective coordinates so we must unavoidably read 2x3x256 bits and then write 3x256 bits, plus masking. And the asm that I'm seeing from rustc looks quite reasonable - the mask is expanded into xmm register (once) and then used to combine with the limbs of the 3 field elements using pandn/por.

I'll close this since I don't think there is anything else that can be done within p256/k256 to improve things - the only remaining opportunities that seem likely are using affine coordinates (which would help greatly since we could also then use the mixed addition) and enabling AVX2 to allow for wider memory moves during element selection, both of which are on me.

Thanks for the suggestions and the fast merge.