Cache values for faster Pedersen

Question

Cache values for faster Pedersen

arielgabizon opened this issue 5 years ago · 15 comments

For batches of Pedersen hashes dramatic savings can be made by storing intermediate values.
Details:

The Sapling Pedersen Hash by @ebfull and @daira splits the input into windows of 3 bits.
Associates the j'th window with the group element g_j = [16^j-1] g, where g is the generator of the segment (the input is split into segments of length 3*63 bits, each segment is divided into 63 windows or chunks see section 5.4.1.7 here https://github.com/zcash/zips/blob/master/protocol/protocol.pdf)
The 3 bit input of each window is interpreted as a number in c_j in {-4,-3,-2,-1,1,2,3,4}
and [c_j] g_j is added to the output.
In current implementation
https://github.com/zcash-hackworks/sapling-crypto/blob/b70d6e66fc1de66f8609e22a2b13eb11bdeb69b3/src/pedersen_hash.rs#L37-L76

each window requires at worst case 8 group operations and at best case 4 group operations,
to compute [c_j] g_j, together with the next base value g_j+1.

Precomputing all the values [c] g_j for c in {-4,-3,-2,-1,1,2,3,4} and j in {1..63} compresses these 4-8 operations to one addition at the cost of storing 8 * 63 = 492 group elements each of 32 bytes.

Naturally you could go further and precompute all possible output contributions of each block of d windows for some parameter d<63.
This would compress 8d group operations into one, at the cost of precomputing and storing, denoting by T the ceiling of 63/d,
8^d * T group elements or 8^d * 32T bytes.

Recall that when hashing 64 bytes into 32 bytes we have 3 such segments so in fact need to cache
8^d * 3T group elements or 8^d * 96T bytes.

For example, taking d=4, we should be able to reduce ped hash computation time by a 32 factor,
by caching
4096 * 16 * 96 = 6,291,456 bytes.

dignifiedquire commented 5 years ago

#783

Answer 1 · 2019-06-12T14:10:58.000Z

Is the idea to cache that amount per hash or for all hashes?

Answer 2 · 2019-06-12T14:24:58.000Z

for all hashes

Answer 3 · 2019-06-15T00:44:25.000Z

Based on discussion with @arielgabizon, this is cache-size information:

| SPEED-UP | CACHE-ELEMENTS | SINGLE-CACHE-BYTES | CACHE-COUNT | TOTAL-CACHE-BYTES |
| -------- | -------------- | ------------------ | ----------- | ----------------- |
|        1 | 8              | 256                |         171 | 43,691            |
|        2 | 64             | 2,048              |          86 | 174,763           |
|        3 | 512            | 16,384             |          57 | 932,068           |
|        4 | 4,096          | 131,072            |          43 | 5,592,406         |
|        5 | 32,768         | 1,048,576          |          35 | 35,791,395        |
|        6 | 262,144        | 8,388,608          |          29 | 238,609,295       |
|        7 | 2,097,152      | 67,108,864         |          25 | 1,636,178,018     |
|        8 | 16,777,216     | 536,870,912        |          22 | 11,453,246,123    |
|        9 | 134,217,728    | 4,294,967,296      |          19 | 81,445,305,762    |

As you can see, the size of the cache quickly becomes unmanageable if we assume we will hold it all in memory at once (TOTAL-CACHE-BYTES). However, since merkle-tree generation requires many hashes in parallel, there is an opportunity to effectively multiply the cache size. For each row of the tree, compute a partial cache (or better, load precomputed from disk). Use this cache to hash the first N input bits for every 64-byte hash input. Then load the new cache and repeat, adding the incremental results to the previous. This would allow as much caching speedup as desired, at the cost of many passes over the data and loads from disk. The size of the cache might need to vary per row in order to justify these costs.

In the table above, CACHE-COUNT is the number of passes over the data are needed if only the SINGLE-CACHE-BYTES are held in memory at once.

This continues to suggest that dedicated machines just for merkle-tree generation may be desirable -- so hardware can be configured to optimize for this usage pattern. Ideally, we would implement this in a way that makes size of the cache/window-bits entirely configurable. The first pass of the implementation probably should not include the multiple-cache solution, which adds complexity, but the possibility of needing that later should be considered when designing.

We should explore that option through calculations, though.

Answer 4 · 2019-06-16T05:43:44.000Z

codename: "Spedersen"

Answer 5 · 2019-06-19T15:59:51.000Z

I looked into this more, and I think what @arielgabizon and I discussed above (which involves some differences from the original hope of this issue already) needs to be modified further. This is because the second loop in the pdersen hashes is not using a 3-bit (size 8) window. Rather, it's an 8-bit (size 256) window. That changes the table to look like this:

| SPEED-UP | CACHE-ELEMENTS                         | SINGLE-CACHE-BYTES                        | CACHE-COUNT | TOTAL-CACHE-BYTES                          |
| -------- | --------------                         | ------------------                        | ----------- | -----------------                          |
|        1 | 256                                    | 8,192                                     |          64 | 524,288                                    |
|        2 | 65,536                                 | 2,097,152                                 |          32 | 67,108,864                                 |
|        3 | 16,777,216                             | 536,870,912                               |          22 | 11,453,246,123                             |
|        4 | 4,294,967,296                          | 137,438,953,472                           |          16 | 2,199,023,255,552                          |
|        5 | 1,099,511,627,776                      | 35,184,372,088,832                        |          13 | 450,359,962,737,050                        |
|        6 | 281,474,976,710,656                    | 9,007,199,254,740,992                     |          11 | 96,076,792,050,570,582                     |
|        7 | 72,057,594,037,927,936                 | 2,305,843,009,213,693,952                 |          10 | 21,081,993,227,096,630,419                 |
|        8 | 18,446,744,073,709,551,616             | 590,295,810,358,705,651,712               |           8 | 4,722,366,482,869,645,213,696              |
|        9 | 4,722,366,482,869,645,213,696          | 151,115,727,451,828,646,838,272           |           8 | 1,074,600,728,546,337,044,183,268          |
|       10 | 1,208,925,819,614,629,174,706,176      | 38,685,626,227,668,133,590,597,632        |           7 | 247,588,007,857,076,054,979,824,845        |
|       11 | 309,485,009,821,345,068,724,781,056    | 9,903,520,314,283,042,199,192,993,792     |           6 | 57,620,481,828,555,881,886,213,782,063     |
|       12 | 79,228,162,514,264,337,593,543,950,336 | 2,535,301,200,456,458,802,993,406,410,752 |           6 | 13,521,606,402,434,446,949,298,167,524,011 |

This brings the cache size up even further (with all the same logic applying), and therefore further reduces the expected overall benefit. I did validate that the best-case of making the changes is likely to do what we expect — by doubling the window size (from 8 to 16) and observing a 2x speedup in the pedersen hash benchmark in sapling_crypto.

This also adds complexity, though because even processing two windows at a time will require more bits than are produced by the 'first loop'. That means we'll either need a lot of complexity to retain the same algorithm, or the actual result of the pedersen hashes will change because of a new strategy for how many bits are assigned to each generator. We would need to discuss whether such an approach could even be viable.

Based on the above, I think we should consider this now to be 2-3x at best — and with a lot of accompanying effort to rework the implementation.

Answer 6 · 2019-06-19T17:08:36.000Z

I've had time to think about this a bit more and also discussed some planning issues with @sanchopansa. My tentative conclusions are:

As a straightforward memoization grouping multiple windows and their corresponding point additions into a single lookup, we can definitely get the expected speedup.
Because of the actual cache sizes involved, it may be that a factor of 2 or 3 is the best we can expect, though.
We should not mess around with changes that would alter the hashing result — as this would invalidate the analysis of collision-resistance on which the current algorithm is based.
However, because of how the nested loops are structured, the 'straightforward engineering' solution to memoizing multiple operations would actually require a fair amount of work to restructure the code. (Not impossible, but not entirely trivial either.)
It would be much easier to do if the implementation had been designed to allow for this from the start.
We plan to port this algorithm to Zexe anyway.
Since it needs to be implemented from scratch and cannot be simply copied over anyway, this is an opportunity to design for this optimization from the beginning.
Therefore, the tentative plan is to pause work on implementing this in Bellman and instead plan for the Zexe port to add a parameter for how many windows should grouped into a single lookup. A default of 1 would produce Bellman's current behavior. This setting should not affect the resulting hash values.

If someone has more information or a different analysis, we can revisit. Otherwise, I'm prepared to consider the initial investigation complete and tentatively adopt the plan outlined above.

Please note that the original issue is somewhat out of date because of what we have learned from closer reading of the code.

Answer 7 · 2019-06-19T20:08:06.000Z

@sanchopansa @jake @arielgabizon @nicola

One more update: it seems that just adjusting window_size in the code is in fact a way of configuring this option already. (I have only checked this by running the tests, which all pass after changing 8 to 16 in the implementation of pedersen_hash_exp_window_size (jubjub/mod.rs). I need to read the code more closely to convince myself why this works, but it's unsurprising that the elaborate caching mechanism already implemented does in fact work that way.

So… I now think the answer is that we should add a mechanism to make this configurable and/or choose the default value we want. We could still consider the 'many passes with smaller in-memory cache' strategy, but the incremental gains might well not warrant it.

67Mb for 2x seems like it may be a good tradeoff in our setting. 11GiB for 3x not so much. 500MiB for (maybe? best case if overhead proves cheap) 3x plus a lot of complexity in implementation … maybe.

It seems we can also set this to any bit size, so the optimal tradeoff may be at some speedup which is not an integer multiple. This might be worth benchmarking to see real performance numbers in both speed of creating large merkle trees and actual memory consumed.

It seems plausible that the best answer will be empirical discovery of a sweet spot which gives us best bang for the buck. Whatever we learn there will help us model this for projections. We will need to be able to do so in order to choose values for the merkle-hybrid thresholds by layer.

This might be something @jake can work on after all. @dignifiedquire would you help think about how best to conduct and report that investigation?

Answer 8 · 2019-07-08T10:58:49.000Z

Although it seems clear in retrospect (that we cannot), I thought maybe we could get some benefit from caching in the first loop (which calculates the scalar). Unfortunately, the bookkeeping required to manage indexing into the cache is effectively like just performing the calculations. Plus the edge cases involved in dealing with partial 'chunks' of windows add a lot of complexity.

My conclusion is that we can't gain any benefit this way (unfortunately), and I don't think we should put any more energy into trying. I've pushed a branch implementing this, for the sake of completeness and in case someone wants to have a look, though. (https://github.com/filecoin-project/sapling-crypto/tree/feat/pedersen-cache)

To be clear, we can still manually set the window size for the second loop and get a speedup at the cost of larger cache. @arielgabizon has also noted that we can reduce the required cache by calculating rather than caching the negation, since this is relatively cheap. That's a strategy we could explore if we find we need to push the limit and are confident (after calculating) that the space savings warrant the complexity of a new implementation.

Answer 9 · 2019-07-08T16:20:19.000Z

A small credit: not caching negation values was pointed to me by @ValarDragon

Answer 10 · 2019-07-11T20:17:14.000Z

One more thought: since we will most likely end up porting this pedersen hashing algorithm to ZEXE, it may make most sense to include the 'no-cache negation' optimization described above in the intial implementation. @sanchopansa, @stefandeml.

Answer 11 · 2019-07-16T18:59:06.000Z

@nicola @arielgabizon @DrPeterVanNostrand @dignifiedquire

Here are benchmarks of pedersen hashing based on changing pedersen_hash_exp_window_size on line 184 of sapling-crypto/src/jujbjub/mod.rs. The default is 8, and I tested with 16 and 24.

Results Summary

WINDOW-SIZE	TIME	SPEEDUP	MEMORY KiB	MEMORY-SCALE	MEMORY-DELTA
8	26343	1.0	607416	1.0	0
16	15243	1.728203	658776	1.0845549	51360
20	14018	1.8792267	8522924	14.031445	7915508
24	12039	2.1881385	64251576	105.778534	63644160

Bear in mind that the memory increase factor is for total memory consumption, which at smaller cache sizes is not dominated by the spedersen change. As you can see, we can get a 1.73x speedup for only 50KiB — an obvious benefit we should probably make the default. To move that to 1.88 requires 8GiB, and getting above 2x to 2.19 requires 64GiB!

The proposed optimization will shift those numbers somewhat, and we can calculate it. However, the general pattern will remain similar. We should expect to see a modest increase toward (but probably less than) 2x at reasonable memory costs.

If you compare with the projected table above, you can see that realized performance is a little worse than predicted, but the general pattern holds.

I stick by what we discussed earlier (sync), which is that we should think of this as being about a 2x max speedup, and that the exact amount will depend on memory tradeoffs which have to be evaluated against other time-space tradeoffs also being made.

Bench output

➜  sapling-crypto git:(master) ✗ /usr/bin/time cargo bench pedersen # window size: 8
   Compiling fil-sapling-crypto v0.1.1-alpha.0 (/home/porcuquine/dev/sapling-crypto)
    Finished release [optimized] target(s) in 9.86s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      26,343 ns/iter (+/- 26,609)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

36.04user 0.58system 0:10.17elapsed 360%CPU (0avgtext+0avgdata 607416maxresident)k
0inputs+30232outputs (0major+221617minor)pagefaults 0swaps
➜  sapling-crypto git:(master) /usr/bin/time cargo bench pedersen # window size: 8
    Finished release [optimized] target(s) in 0.04s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      28,822 ns/iter (+/- 428)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

0.29user 0.01system 0:00.31elapsed 100%CPU (0avgtext+0avgdata 32576maxresident)k
0inputs+0outputs (0major+5647minor)pagefaults 0swaps
➜  sapling-crypto git:(master) /usr/bin/time cargo bench pedersen # window size: 16
   Compiling fil-sapling-crypto v0.1.1-alpha.0 (/home/porcuquine/dev/sapling-crypto)
    Finished release [optimized] target(s) in 9.99s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      15,243 ns/iter (+/- 342)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

38.05user 0.68system 0:12.09elapsed 320%CPU (0avgtext+0avgdata 658776maxresident)k
0inputs+30256outputs (0major+385878minor)pagefaults 0swaps
➜  sapling-crypto git:(master) ✗ /usr/bin/time cargo bench pedersen # window size: 16
    Finished release [optimized] target(s) in 0.05s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      16,242 ns/iter (+/- 166)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

1.75user 0.16system 0:01.92elapsed 100%CPU (0avgtext+0avgdata 658692maxresident)k
0inputs+0outputs (0major+168305minor)pagefaults 0swaps
➜  sapling-crypto git:(master) ✗ /usr/bin/time cargo bench pedersen # window size: 24
   Compiling fil-sapling-crypto v0.1.1-alpha.0 (/home/porcuquine/dev/sapling-crypto)
    Finished release [optimized] target(s) in 9.90s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      12,039 ns/iter (+/- 530)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

322.87user 28.39system 5:34.18elapsed 105%CPU (0avgtext+0avgdata 64251576maxresident)k
37992inputs+30256outputs (636major+29056437minor)pagefaults 0swaps
➜  sapling-crypto git:(master) ✗ /usr/bin/time cargo bench pedersen # window size: 20
   Compiling fil-sapling-crypto v0.1.1-alpha.0 (/home/porcuquine/dev/sapling-crypto)
    Finished release [optimized] target(s) in 10.10s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 2 tests
test circuit::pedersen_hash::test::test_pedersen_hash ... ignored
test circuit::pedersen_hash::test::test_pedersen_hash_constraints ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 85 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash ... bench:      14,018 ns/iter (+/- 36)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

55.82user 2.27system 0:31.79elapsed 182%CPU (0avgtext+0avgdata 8522924maxresident)k
47088inputs+30256outputs (252major+2350804minor)pagefaults 0swaps

Answer 12 · 2019-07-19T16:49:08.000Z

Circuit tests are timing out with window size 16: filecoin-project/sapling-crypto#5

I'm not yet sure if this is a real problem, but we should at least add some circuit benchmarks. We may need to independently set the value when running in and out of circuits.

cc: @arielgabizon @DrPeterVanNostrand

Answer 13 · 2019-07-28T22:05:06.000Z

Updated benchmarks, probing window sizes of 17, 18, 19.

17 looks like another sweet spot — actually the best performance before it begins decreasing, but going up to 600+MiB overhead.

We can probably settle on either 16 or 17 and call that good for now.

WINDOW-SIZE	TIME	SPEEDUP	MEMORY KiB	MEMORY-SCALE	MEMORY-DELTA
8	26343	1.0	607416	1.0	0
16	15243	1.728203	658776	1.0845549	51360
17	14534	1.8125086	1232136	2.028488	624720
18	14727	1.7887553	2460928	4.0514703	1853512
19	14738	1.7874203	4590888	7.5580626	3983472
20	14018	1.8792267	8522924	14.031445	7915508
24	12039	2.1881385	64251576	105.778534	63644160

Raw results: for completeness note that the timing with window size 16 during this run was a little slower than previously.

➜  sapling-crypto git:(feat/spedersen-1) ✗ /usr/bin/time cargo bench pedersen_hash_16
    Finished release [optimized] target(s) in 0.05s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 87 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash_16 ... bench:      16,291 ns/iter (+/- 327)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

2.01user 0.14system 0:02.16elapsed 100%CPU (0avgtext+0avgdata 658668maxresident)k
0inputs+0outputs (0major+168314minor)pagefaults 0swaps
➜  sapling-crypto git:(feat/spedersen-1) ✗ /usr/bin/time cargo bench pedersen_hash_17
    Finished release [optimized] target(s) in 0.05s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 87 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash_17 ... bench:      14,534 ns/iter (+/- 73)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

3.09user 0.29system 0:03.38elapsed 100%CPU (0avgtext+0avgdata 1232136maxresident)k
0inputs+0outputs (0major+311665minor)pagefaults 0swaps
➜  sapling-crypto git:(feat/spedersen-1) ✗ /usr/bin/time cargo bench pedersen_hash_18
    Finished release [optimized] target(s) in 0.04s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 87 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash_18 ... bench:      14,727 ns/iter (+/- 510)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

5.93user 0.56system 0:06.49elapsed 100%CPU (0avgtext+0avgdata 2460928maxresident)k
0inputs+0outputs (0major+618872minor)pagefaults 0swaps
➜  sapling-crypto git:(feat/spedersen-1) ✗ /usr/bin/time cargo bench pedersen_hash_19
    Finished release [optimized] target(s) in 0.04s
     Running target/release/deps/fil_sapling_crypto-a4e1769f0be8e66c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 87 filtered out

     Running target/release/deps/pedersen_hash-7308957cda956ed4

running 1 test
test bench_pedersen_hash_19 ... bench:      14,738 ns/iter (+/- 20)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 4 filtered out

10.89user 0.96system 0:11.86elapsed 100%CPU (0avgtext+0avgdata 4590888maxresident)k
0inputs+0outputs (0major+1151316minor)pagefaults 0swaps

Answer 14 · 2019-08-08T15:09:05.000Z

We have merged configurable window sizes as well as set the new default to 16