Forest building on parallel threads is not a good idea
dhanak opened this issue · 2 comments
I have been using DecisionTree
to build regression forests for a while. Recently, I started putting my data loading and preprocessing tasks on parallel threads to speed things up, and noticed that the forest building, without further ado, also started using the available threads for its purposes. All the better, right? Not exactly. My happiness was short lived, when I realized that suddenly my experiments became nondeterministic, even though they were perfectly deterministic on a single thread, by using a fixed random seed at startup. I quickly figured out that the parallel threads in the forest building logic rely on drawing random numbers from the generator, which now of course ran in a nondeterministic order:
forest = Vector{LeafOrNode{S, T}}(undef, n_trees)
Threads.@threads for i in 1:n_trees
inds = rand(rngs, 1:t_samples, n_samples)
forest[i] = build_tree(
labels[inds],
features[inds,:],
n_subfeatures,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase,
rng = rngs)
end
This is definitely not good. I would very much prefer to have at least an option to disable using threads for forest building.
Thanks for catching this!
You can now manually input a seed for reproducible multi-threaded RFs
build_forest(labels, features ; rng = 3)
Thanks for the fix! I tried it, and it works well!