This repository contains demonstration of technique of batched small-field sumcheck using Frobenius twists. My mathematical notes are currently in disarray, but I will try to provide extensive commentary. I hope that following the code you will be able to figure the rest out, and article will be published when I catch my breath.
For now, you can check article stub here: https://hackmd.io/@levs57/SJ4fuZMD0
Clone repository, switch to nightly, and run
RUSTFLAGS="-Ctarget-cpu=native" RUST_BACKTRACE=1 cargo test main_protocol --release -- --nocapture
I have 2 backends - one using AVX-256 instructions, and another using Neon for ARM architecture. There is currently no soft backend, so in case your device does not support any of those, the code will be unsound (and will likely crash). You will need L1 Cache of size at least 128Kb, or the performance will be suboptimal.
I would like to thank
- My collaborator Olena @Fitznik from Matter Labs for help with various parts of this code and Neon backend
- My colleague Amir @aliventer who taught me how to use Linux Perf and check cache performance
- Shahar Papini @spapinistarkware for reviewing earlier version of this code
- Justin Thaler and Srinath Setty for various discussions about sumcheck
First, we present a technique comparable to Binius code switch unpacking. It is conceptually easier, allegedly somewhat faster, and it works over any field and in any basis (i.e. it does not require an extension tower).
Let us assume we are in extension
Now, assume that we have a packed polynomial
However, Frobenius action on
Opening polynomial in a Frobenius orbit is much cheaper than in a random set of points. For the reference, check multiclaim.rs file. High-level idea is that for a sum
Linear operations in sumcheck are essentially free, provided that they are vectorized enough. Reason for this is that we can reduce the problem to smaller sumcheck:
only over subset of variables.
The core of the argument is boolcheck - for arbitrary quadratic formula over base field we treat it as a formula depending on
namely, we find an evaluation set
hashcaster/src/protocols/utils.rs
Line 340 in 9e5a35b
Then, we restrict all coordinate polynomials on first
hashcaster/src/protocols/utils.rs
Line 449 in 9e5a35b
These are results from my relatively old 4 core i5:
... Generating initial randomness, it might take some time ...
... Preparing witness...
>>>> Witness gen took 114 ms
>>>> Evaluation of output took 40 ms
>> Total witness / claim generation time: 154 ms
>>>> Initialization (cloning) took: 46 ms
>>>> Table extension took: 307 ms
>>>> Rounds took: 483 ms
>>>> Verifier took: 1 ms
>> Boolcheck total time: 838 ms
... Entering multiopen phase ...
>> Multiopen took 74 ms
... Entering linear layer ...
>>>> Data prep (clone/restrict) took 70 ms
>>>> Main cycle took 1 ms
>> Linlayer took 71 ms
TOTAL TIME: 1139 ms
test examples::keccak::main_protocol::main_protocol ... ok
be warned that is not end to end test - it lacks both commitment (likely no more than 10-15% of time), and rotation argument to fit inputs with outputs. On other hand, there are known optimizations that I didn't do yet.
Marketplace of ideas is very inefficient, this is why I've spent a lot of my time to prove that this approach is feasible. But I'm not a cryptographic engineer, I'm a mathematician. I have very poor idea about low level optimizations. My hope is that this gets significant interest from the community and we will get performance to much higher level (while this is already respectable, I believe real engineers could do much, much better).