Introduce data-parallelism with Rayon

Question

Introduce data-parallelism with Rayon

gbenattar opened this issue 7 years ago · 15 comments

The purpose of this task is to introduce data-parallelism with Rayon at first in the zero knowedge proof (refactored with the use of iterator).

Comparison benches will be posted.

Answer 1 · 2018-06-06T08:15:29.000Z

I made a quick attempt but ran into issues that looking like they had something to do with passing around references to Mpz objects, even if rust-gmp seems to implement the needed traits.

Answer 2 · 2018-06-06T16:07:37.000Z

I ran into the same issue, here are the addition:

cargo.toml

rust-gmp = { version="0.4", optional=true }
# [...]
rayon = "1.0.1"

src/lib.rs

#[cfg(feature="proofs")]
extern crate rayon;

https://github.com/mortendahl/rust-paillier/blob/dev/src/proof/mod.rs

let x: Vec<_> = y.par_iter()
             .map(|yi| BigInt::modpow(yi, &ek.n, &ek.n))
             .collect();

cargo build

error[E0599]: no method named `par_iter` found for type `std::vec::Vec<arithimpl::gmpimpl::gmp::mpz::Mpz>` in the current scope
  --> src/proof/mod.rs:95:27
   |
95 |         let x: Vec<_> = y.par_iter()
   |                           ^^^^^^^^
   |
   = note: the method `par_iter` exists but the following trait bounds were not satisfied:
           `std::vec::Vec<arithimpl::gmpimpl::gmp::mpz::Mpz> : rayon::iter::IntoParallelRefIterator`
           `[arithimpl::gmpimpl::gmp::mpz::Mpz] : rayon::iter::IntoParallelRefIterator`

error: aborting due to previous error

FYI: https://github.com/mortendahl/rust-paillier/blob/dev/src/arithimpl/gmpimpl.rs.

Answer 3 · 2018-06-06T16:33:12.000Z

Updating to rust-gmp 0.5.0 fixes the issue with the introduction of "Sync".

Answer 4 · 2018-06-06T22:23:25.000Z

cargo outdated is a great command btw in case you didn't know it already ;)

Answer 5 · 2018-06-07T03:06:03.000Z

Initial results without Rayon:

test self::bench_zk_proof_challenge_1024        ... bench: 284,917,530 ns/iter (+/- 58,277,547)
test self::bench_zk_proof_prove_1024            ... bench: 301,056,360 ns/iter (+/- 32,855,725)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 298,151,884 ns/iter (+/- 38,319,745)

Answer 6 · 2018-06-07T08:24:52.000Z

I suppose that's okay (.3s)..? Did you try to bench without rayon as well?

Answer 7 · 2018-06-07T08:27:11.000Z

It is WIP, I will post results here.

Answer 8 · 2018-06-07T08:30:01.000Z

ah sorry, I misread. looking forward!

Answer 9 · 2018-06-08T13:06:53.000Z

Initial results with Rayon:

test self::bench_zk_proof_challenge_1024        ... bench: 133,593,920 ns/iter (+/- 3,243,637)
test self::bench_zk_proof_prove_1024            ... bench: 159,987,877 ns/iter (+/- 4,681,477)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 159,776,794 ns/iter (+/- 4,358,748)

Answer 10 · 2018-06-08T13:40:20.000Z

nice! 👍 does bencher play nicely with threads or are there some uncertainty due to that in the numbers?

Answer 11 · 2018-06-10T19:08:46.000Z

No it is pretty much stable, here is a test containing also 2048 bits key size:

running 5 tests
test self::bench_zk_proof_challenge_1024        ... bench: 163,783,130 ns/iter (+/- 57,689,537)
test self::bench_zk_proof_prove_1024            ... bench: 160,326,807 ns/iter (+/- 13,572,992)
test self::bench_zk_proof_prove_all_1024        ... bench: 291,581,102 ns/iter (+/- 6,282,444)
test self::bench_zk_proof_prove_all_2048        ... bench: 2,078,006,591 ns/iter (+/- 23,361,374)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 157,755,876 ns/iter (+/- 3,513,484)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured

Answer 12 · 2018-06-10T22:54:46.000Z

@gbenattar out of the 5 tests - only number 4 was for 2048bits?
Can you please add benchmarks for 2048bits without Rayon?

Answer 13 · 2018-06-11T07:45:27.000Z

Hi @gbenattar, sorry it wasn't clear; what I was wondering was whether the multi-threading nature of bencher interfered with rayon performance. From a quick skim bencher might run the tests in parallel (using one thread per core), meaning rayon might be penalised. Would you mind trying the benches without concurrency just for fun? :)

Answer 14 · 2018-06-11T18:42:52.000Z

So here is a quick test for a single bench:

When reverting the code to use iter():

test self::bench_zk_proof_challenge_1024        ... bench: 289,669,699 ns/iter (+/- 45,791,433)

When using par_iter() with RUST_TEST_THREADS=1 cargo bench

test self::bench_zk_proof_challenge_1024        ... bench: 142,414,039 ns/iter (+/- 18,410,610)

When using par_iter() with cargo bench (unset RUST_TEST_THREADS was done before):

test self::bench_zk_proof_challenge_1024        ... bench: 134,116,663 ns/iter (+/- 22,447,834)

Not sure why the last 2 provide similar results - do you have any idea?

Side note: I am running this on the following machine:

sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz

Answer 15 · 2018-06-17T18:22:34.000Z

Merged: #17.