Building `fmph::keyset::DynamicKeySet` from a `rayon::iter::ParallelIterator` instead of `Iterator`
progval opened this issue · 7 comments
Hi,
I'm trying to build a DynamicKeySet
(actually, a CachedKeySet
) from a rayon::iter::ParallelIterator
instead of an Iterator
. I feel this makes sense given I have a bunch of iterators as my own input, and fmph
internally uses them as well.
I attempted to do this by copying the DynamicKeySet
structure, and crudely replacing every : Iterator
bound with a : ParallelIterator
. This works for the most part; however I am stuck implementing for_each_key
and into_vec
as their predicate lack the Send
and Sync
bounds necessary to call rayon's .map()
.
And adding these bounds to the KeySet
definition seems excessive, given that not all users need them.
Any tips?
Thanks!
Hi,
The F
function in for_each_key
is more efficient in the single-threaded version, and such a version is always used when rayon has only 1 thread (which can only be checked at runtime).
The solution is a KeySet
which will have two functions, the first creating an Iterator
and the second a rayon::iter::ParallelIterator
.
I will try to implement something like this soon.
Please check if the code I have just added to the repository works and is useful to you.
See: CachedKeySet::dynamic_par(keys: impl GetParallelIterator, clone_threshold: usize)
and DynamicParKeySet
.
Note that GetParallelIterator is implemented by a pair of functions:
(F: Fn() -> impl Iterator, PF: Fn() -> impl ParallelIterator)
works fine on my test dataset (23 keys), but for a larger one (602041 keys) it seems to call par_iter
an infinite number of times if it doesn't return None
; despite my key order being consistent. I'll give you some minimal code to reproduce the issue as soon as I can.
Thanks. I've just fixed a bug and added the test which suggests that everything should work now.
Amazing, thanks! It works and I'm getting a 100× speedup on 100 threads.
I need to fix my code because a few duplicates in my "real life" dataset make the computation crash (7 duplicates over 27 billion keys :( ), but it's looking very promising
Good to hear! I will send ph=0.8
to cargo soon.
I've merged DynamicParKeySet
with DynamicKeySet
and changed some names (e.g. dynamic_par
-> dynamic
), see:
01f3b12
Note that you can use try_with_conf_stats_or_partial
(GO)Function
constructor to find duplicates in expected linear time if you have no memory for other methods (like using simple HashSet
).