beling/bsuccinct-rs

Building `fmph::keyset::DynamicKeySet` from a `rayon::iter::ParallelIterator` instead of `Iterator`

progval opened this issue · 7 comments

Hi,

I'm trying to build a DynamicKeySet (actually, a CachedKeySet) from a rayon::iter::ParallelIterator instead of an Iterator. I feel this makes sense given I have a bunch of iterators as my own input, and fmph internally uses them as well.

I attempted to do this by copying the DynamicKeySet structure, and crudely replacing every : Iterator bound with a : ParallelIterator. This works for the most part; however I am stuck implementing for_each_key and into_vec as their predicate lack the Send and Sync bounds necessary to call rayon's .map().

And adding these bounds to the KeySet definition seems excessive, given that not all users need them.

Any tips?

Thanks!

beling commented

Hi,

The F function in for_each_key is more efficient in the single-threaded version, and such a version is always used when rayon has only 1 thread (which can only be checked at runtime).
The solution is a KeySet which will have two functions, the first creating an Iterator and the second a rayon::iter::ParallelIterator.
I will try to implement something like this soon.

beling commented

Please check if the code I have just added to the repository works and is useful to you.

See: CachedKeySet::dynamic_par(keys: impl GetParallelIterator, clone_threshold: usize) and DynamicParKeySet.
Note that GetParallelIterator is implemented by a pair of functions:
(F: Fn() -> impl Iterator, PF: Fn() -> impl ParallelIterator)

works fine on my test dataset (23 keys), but for a larger one (602041 keys) it seems to call par_iter an infinite number of times if it doesn't return None; despite my key order being consistent. I'll give you some minimal code to reproduce the issue as soon as I can.

beling commented

Thanks. I've just fixed a bug and added the test which suggests that everything should work now.

Amazing, thanks! It works and I'm getting a 100× speedup on 100 threads.

I need to fix my code because a few duplicates in my "real life" dataset make the computation crash (7 duplicates over 27 billion keys :( ), but it's looking very promising

beling commented

Good to hear! I will send ph=0.8 to cargo soon.
I've merged DynamicParKeySet with DynamicKeySet and changed some names (e.g. dynamic_par -> dynamic), see:
01f3b12

beling commented

Note that you can use try_with_conf_stats_or_partial (GO)Function constructor to find duplicates in expected linear time if you have no memory for other methods (like using simple HashSet).