statistikat/VIM

transform xgboostImpute and rangerImpute into a generic function with methods for formula and data

alexkowa opened this issue · 4 comments

@GregorDeCillia I included a new function xgboostImpute very similar to your rangerImpute function. On first sight, it performs very well. The functions take formulas as first input.
To make it more pipe-friendly and aligned with other imputation functions, should we
A) simply change the order of parameters so the first input is the data set (possibly breaking code of some)
B) create new generic functions that make a method dispatch based on the first input

@GregorDeCillia and also @matthias-da @JohannesGuss what do you think?

@alexkowa looks good! I just made some slight changes to rangerImpute() in the case of factors. Now when a factor is imputed the imputed value is randomly drawn using the predicted probabilites from the model output.
Maybe this should also be adopted to xgboostImpute()

I would personally opt for version A.

Good idea to sample. Yes, let's do that for XGBoost too if someone has time to implement it.

Not sure if introducing braking changes is worth it although I totally agree that the dataset as the first argument makes more sense especially for usage with the native pipe from R 4.0. Making an S3 dispatch should allow this change without breaking old code.

rangerImpute.formula <- function(x ...) {}
rangerImpute.data.frame <- function(x, ...) {}

In the end of the day, it would be also good to compare both rangerImpute and xgboostImpute with missRanger and mixgb (althought both can be used in a chain to impute multivariate missingness), especially not only for precision measures (comparing imputed and original data values in a simulation) but also on coverage rates and root mean squared errors on estimators. I can do this when there is a bit time for it. It might give an idea about if imputation uncertainty and model uncertainty are treated well.

One argument against almost all imputation methods in VIM that I hear often is that we only account for imputation uncertainty (draw from predictive distributions, one can also think about PMM and midastouch) but not for model uncertainty (e.g. with a bootstrap which would be very simple to implement (at least as an option)).

I recently implemented PMM and midastouch in function imputeRobust (just committed).
There is a function imputeRobustChain - this is very unfinished, ignore it.
There is a function imputeRobust that has a lot of enhancements in comparison to irmi

  • complex formulas can be provided, e.g. log(income) ~ I(age^2) + region * whatever for each varibable with missings.
  • PMM and midastouch are available, PMM as the default
  • model uncertainty is considered by different versions of robust bootstraps
  • and some more.

What is missing is testing and code improvement (almost no checks implemented) - its currently a working solution and - of course - there is no time to do this since months :-( If somebody is interested...?

So, one might use the PMM and midastouch from imputeRobust in rangerImpute and xgboostImpute?