Possible optimization for secp256k1

Question

Possible optimization for secp256k1

sipa opened this issue a year ago · 10 comments

The current C code for the 5x52 field implementation in libsecp256k1 has (at least) this optimization which seems not present in the fiat-crypto output (it's also absent in libsecp256k1's asm code, which probably explains how it was missed): bitcoin-core/secp256k1#810

I'm really happy with how close the fiat-crypto generated code seems to be getting, and we're seriously contemplating incorporating it in the library, but it'd be an easier sell if it has at least all the same high-level optimizations.

Answer 1 · 2023-04-20T02:36:26.000Z

Hi,
just wanted to gauge how complex this would be to implement in fiat-crypto.

As far as I can see from the changes in C-code, it replaces an

c & M * R  // producing AND and MULX
c>>52        // producing SHR 
...
c * R          // 128*64 -> 2 MULX + ADDs (or maybe 2 ADDs)

by

R*c                  // MULX
c>>64              // free
(R<<12) * c     // SHL and MULX

This happens twice in each function.

I think this is genius!

I've dug a little through the fiat-code and only found this Dettman-related file and in it I could only identify lines 82 -- 99, which seem like they specify the multiplication, but I am too unfamiliar with what is actually happening.
I'm more than happy to help with this wherever I can, but I'd need some help from the experts @OwenConoly, @JasonGross, @andres-erbsen .

Answer 2 · 2023-04-20T03:35:15.000Z

it's also absent in libsecp256k1's asm code, which probably explains how it was missed

Yeah, I referenced the asm code when I wrote fiat-crypto's template, so that would explain it.

My thought is that this change would not be very difficult to implement in fiat-crypto. It looks just like the same sort of operations that we already have. I haven't yet had time to look at this in detail, though.

Answer 3 · 2023-04-30T22:51:11.000Z

Hi @OwenConoly , did you have time already?
Would it help, if I would rewrite the assembly incorporating the optimization?

Answer 4 · 2023-05-01T06:03:01.000Z

Hi! Sorry for the delay, this part of the semester has been a bit chaotic for me. I've actually generated the new code already - you can see it on the dettman_avoid_wide_mul branch of fiat-crypto, in the file secp256k1_dettman_64.c.

Disclaimer: this C code is not yet formally verified, and I haven't yet tested it either. I have to write a few proofs still and refactor some things, which is why I haven't created a pull request for this yet.

Answer 5 · 2023-05-01T06:11:22.000Z

@sipa, @dderjoel - do these changes look like what you would expect to see?

Answer 6 · 2023-05-01T07:05:15.000Z

@sipa, @dderjoel - do these changes look like what you would expect to see?

looks like it to me. If you could generate the JSON, I could start some optimization runs and we could see if that gives us more performance :)

Answer 7 · 2023-05-02T00:51:35.000Z

Here's the JSON:
https://github.com/mit-plv/fiat-crypto/blob/69a2d4bc867f396b9002d4d57c7c87d169bfd2dd/fiat-json/src/secp256k1_dettman_64.json

I've finished proofs of the mul and square operations. The only remaining obstacle to merging is that my code is a bit of a mess.

Answer 8 · 2023-05-02T07:31:49.000Z

Thank you very much, Owen, for engaging here.

I was able to compile from this commit there, and generate and prove some code locally. I'll run on our machines now and report once I've got benchmarks.

Answer 9 · 2023-05-03T00:40:40.000Z

The numbers are looking very good! Have a look at bitcoin-core/secp256k1#1261 if you like. The gist is, that Fiat-C is 2.3% faster (on scmuls) than the existing C and with CryptOpt we can be 6.2% faster.

Answer 10 · 2023-05-20T16:44:37.000Z

Resolved by #1601