mortendahl/rust-paillier

Introduce data-parallelism with Rayon

gbenattar opened this issue ยท 15 comments

The purpose of this task is to introduce data-parallelism with Rayon at first in the zero knowedge proof (refactored with the use of iterator).

Comparison benches will be posted.

I made a quick attempt but ran into issues that looking like they had something to do with passing around references to Mpz objects, even if rust-gmp seems to implement the needed traits.

I ran into the same issue, here are the addition:

  • cargo.toml
rust-gmp = { version="0.4", optional=true }
# [...]
rayon = "1.0.1"
  • src/lib.rs
#[cfg(feature="proofs")]
extern crate rayon;
let x: Vec<_> = y.par_iter()
             .map(|yi| BigInt::modpow(yi, &ek.n, &ek.n))
             .collect();
  • cargo build
error[E0599]: no method named `par_iter` found for type `std::vec::Vec<arithimpl::gmpimpl::gmp::mpz::Mpz>` in the current scope
  --> src/proof/mod.rs:95:27
   |
95 |         let x: Vec<_> = y.par_iter()
   |                           ^^^^^^^^
   |
   = note: the method `par_iter` exists but the following trait bounds were not satisfied:
           `std::vec::Vec<arithimpl::gmpimpl::gmp::mpz::Mpz> : rayon::iter::IntoParallelRefIterator`
           `[arithimpl::gmpimpl::gmp::mpz::Mpz] : rayon::iter::IntoParallelRefIterator`

error: aborting due to previous error

FYI: https://github.com/mortendahl/rust-paillier/blob/dev/src/arithimpl/gmpimpl.rs.

Updating to rust-gmp 0.5.0 fixes the issue with the introduction of "Sync".

cargo outdated is a great command btw in case you didn't know it already ;)

Initial results without Rayon:

test self::bench_zk_proof_challenge_1024        ... bench: 284,917,530 ns/iter (+/- 58,277,547)
test self::bench_zk_proof_prove_1024            ... bench: 301,056,360 ns/iter (+/- 32,855,725)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 298,151,884 ns/iter (+/- 38,319,745)

I suppose that's okay (.3s)..? Did you try to bench without rayon as well?

It is WIP, I will post results here.

ah sorry, I misread. looking forward!

Initial results with Rayon:

test self::bench_zk_proof_challenge_1024        ... bench: 133,593,920 ns/iter (+/- 3,243,637)
test self::bench_zk_proof_prove_1024            ... bench: 159,987,877 ns/iter (+/- 4,681,477)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 159,776,794 ns/iter (+/- 4,358,748)

nice! ๐Ÿ‘ does bencher play nicely with threads or are there some uncertainty due to that in the numbers?

No it is pretty much stable, here is a test containing also 2048 bits key size:

running 5 tests
test self::bench_zk_proof_challenge_1024        ... bench: 163,783,130 ns/iter (+/- 57,689,537)
test self::bench_zk_proof_prove_1024            ... bench: 160,326,807 ns/iter (+/- 13,572,992)
test self::bench_zk_proof_prove_all_1024        ... bench: 291,581,102 ns/iter (+/- 6,282,444)
test self::bench_zk_proof_prove_all_2048        ... bench: 2,078,006,591 ns/iter (+/- 23,361,374)
test self::bench_zk_proof_prove_and_verify_1024 ... bench: 157,755,876 ns/iter (+/- 3,513,484)

test result: ok. 0 passed; 0 failed; 0 ignored; 5 measured

@gbenattar out of the 5 tests - only number 4 was for 2048bits?
Can you please add benchmarks for 2048bits without Rayon?

Hi @gbenattar, sorry it wasn't clear; what I was wondering was whether the multi-threading nature of bencher interfered with rayon performance. From a quick skim bencher might run the tests in parallel (using one thread per core), meaning rayon might be penalised. Would you mind trying the benches without concurrency just for fun? :)

So here is a quick test for a single bench:

  • When reverting the code to use iter():
test self::bench_zk_proof_challenge_1024        ... bench: 289,669,699 ns/iter (+/- 45,791,433)
  • When using par_iter() with RUST_TEST_THREADS=1 cargo bench
test self::bench_zk_proof_challenge_1024        ... bench: 142,414,039 ns/iter (+/- 18,410,610)
  • When using par_iter() with cargo bench (unset RUST_TEST_THREADS was done before):
test self::bench_zk_proof_challenge_1024        ... bench: 134,116,663 ns/iter (+/- 22,447,834)

Not sure why the last 2 provide similar results - do you have any idea?

Side note: I am running this on the following machine:

sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz

Merged: #17.