bips-hb/cpi

`resampling = "oob"` fails when an observation is never out-of-bag

mikoontz opened this issue · 1 comments

Hello! Thanks very much for developing this methodology and package for calculating variable importance.

I'm using it to try to make inference from a random forest model fit to large-ish data and with collinearity in the predictors. I've found the CPI approach to be more satisfying than arbitrarily dropping correlated predictors below some threshold value of correlation.

I was exploring the use of the "oob" method for computing loss, and kept getting the following error:

Error in cpi_fun(j) : 
  task 1 failed - "missing value where TRUE/FALSE needed"

I tracked it down to occurring when an observation is never out-of-bag in every tree (i.e., it is always in bag in every tree). This can happen when observations are weighted or if there aren't very many trees. You can recreate the error using this example from the help file of the package and setting the num.trees argument to something low:

mytask <- as_task_regr(iris, target = "Petal.Length")
cpi::cpi(task = mytask, learner = lrn("regr.ranger", keep.inbag = TRUE, num.trees = 10, seed = 2), 
    resampling = "oob", 
    knockoff_fun = seqknockoff::knockoffs_seq)

In the weeds

The problem starts with the creation of the oob_idx object here:

oob_idx <- ifelse(simplify2array(mod$model$inbag.counts) == 0, TRUE, NA)

If an observation is never out of bag for all the trees, then the rowMeans() calculation for that line is NaN even with na.rm = TRUE in this line:

y_hat <- rowMeans(oob_idx * preds, na.rm = TRUE)
.

Then there are NaN in the predictions, which puts NA in the loss returned by compute_loss(), which all carry through to be NA when calculating dif here:

cpi/R/cpi.R

Line 272 in 42a7e0b

dif <- err_reduced - err_full
. Then the cpi, calculated as mean(dif) returns an NA and the if(cpi == 0) line errors out here:

cpi/R/cpi.R

Line 292 in 42a7e0b

if (cpi == 0) {
.

I think you can get around this by including na.rm = TRUE in the cpi <- mean(dif) call, but maybe there's a deeper philosophical issue where it wouldn't be prudent to drop the "only ever in bag" observations from the CPI calculation? Particularly if we get to that point because of weighting that observation to be favored as in-bag? I suspect this might be the case, in which case perhaps the best change would be a more informative error message?

cpi <- mean(dif, na.rm = TRUE)
se <- sd(dif, na.rm = TRUE) / sqrt(length(which(!is.na(dif))))

Thanks for using the package! This is a helpful and thorough comment. My inclination is to say that the solution here is a more informative error message, since OOB is just not the right resampling strategy if an observation is never out-of-bag. In-sample risk estimators are supported in cpi, but consistency is only guaranteed for out-of-sample estimators. (Adding uninformative covariates will never hurt the in-sample performance of a linear regressor, for example, but will increase generalization error if the features are assigned nonzero weight.) We'll be sure to make the change.