stekhoven/missForest

prediction capability

Closed this issue · 1 comments

It was pointed out by some users that it would be helpful training a missForest on a dataset (including missing values) and then reuse (e.g. predict) the missForest on other data sets - the obvious reason would be to save the computation time of relearning the missForest in each (similar) data set anew.

There are the following things to consider:

  • each step has to be saved;
  • each variable has to be saved;
  • predictions will be iteratively going through all previous steps;
  • it is (probably) not enough to simply use the final forest, since this one uses input values previously adapted by the iterative process of missForest (has to be tested);
  • how to save this in an efficient way.

Elena Albu from KU Leuven has been reaching out to me on this topic. She has been publishing a prediction version of missForest, which you can find on CRAN and the vignette is here: https://cran.r-project.org/web/packages/missForestPredict/vignettes/missForestPredict_usage.html

While this is indeed a solution and we have been running several simulations with the code, the size of the object to be stored can become prohibitively large. Make sure to have enough memory available when trying to use it on large datasets with lots of missings.

Great work, Elena Albu!