curso-r/treesnip

parallel processing

Athospd opened this issue · 5 comments

both catboost and lightgbm have built in parallel processing support, so has {tidymodels}.
Figure out how to implement it correctly.

Maybe this is relevant https://tidymodels.github.io/model-implementation-principles/parallel-processing.html
Particularly:

Computational code in other languages (e.g. Cpp, etc.) should pull from R’s random number streams so that setting the seed prior to invoking these routines ensures reproducibility.

So for catboost we have to pass through random state to random_seed, for lightgbm it is not so clear

catboost has argument thread_count that you can pass in params `params=list(thread_count=4L) uses -1 (means all available) by default. It think that means if you manually try parallel running it slows down because you try to train multiple models that are all hogging the same threads.
For lightgbm there is an num_threads argument you can pass to to train and cv, it uses 0 ( 0 means default number of threads in OpenMP) by default.

Yes, I think in general shouldn't use tune parallel options as lightgbm and catboost already scale well by default with multiple cores. We probably want to document this somewhere.

Worth notice that tidymodels ecosystem embrace the resampling/tunning steps throug {tune} and {rsample} toolkit, so we would never use the lgb.cv() function by design.