ModelOriented/kernelshap

Optional feature selection

Closed this issue · 6 comments

This confused me for a bit because I assumed the L1 penalty was part of the optimization itself (i.e., just another constraint in Eq. 6 of the Covert & Lee paper). But a closer look at the shap code reveals that this is not the case. They just fit a lasso for feature selection then remove some column(s) from Z and run the regular pipeline.

We could implement this with glmnet. We'd need an l1 argument, with possible values NULL (the default, no regularization), "aic", "bic" (for adaptive feature selection), or some integer less than p to set a target number of variables. Then within kernelshap_one, we'd have something like the following (after Z and vz are defined):

# Optional feature selection
  if (!is.null(l1)) {
    # Fit lasso
    fit <- glmnet(x = Z, y = (vz - v0_ext), weights = w, intercept = FALSE)
    if (l1 %in% c('aic', 'bic')) {
      # Compute information criteria for adaptive feature selection
      y_hat <- predict.glmnet(fit, newx = Z, s = fit$lambda)
      eps <- y_hat - (vz - v0_ext)
      rmse <- sqrt(colMeans(eps^2))
      ll <- sapply(seq_len(length(f$lambda)), function(l) {
        sum(dnorm(eps[, l], sd = rmse[l], log = TRUE))
      })
      aic <- 2 * fit$df - 2 * ll
      bic <- log(n) * fit$df - 2 * ll
      k <- ifelse(crit == 'aic', which.min(aic), which.min(bic))
      keep <- predict.glmnet(fit, s = fit$lambda[k], type = 'nonzero')$s1
    } else {
      # Or select some fixed number of features. This can be tricky when
      # no value of lambda delivers precisely the target number of nonzeros
      if (l1 %in% fit$df) {
        k <- max(which(fit$df == l1))
        keep <- predict.glmnet(fit, s = fit$lambda[k], type = 'nonzero')$s1
      } else {
        k <- min(which(fit$df >= l1))
        beta <- abs(as.numeric(coef.glmnet(fit, s = fit$lambda[k])))
        keep <- order(beta, decreasing = TRUE)[1:l1]
      }
    }
    Z <- Z[, keep]
  }

This updated Z is then used to define A and b. Of course, we'd need to set the beta for dropped features to 0 prior to output.

What do you think? I can create a pull request if you like.

Thanks for digging into this L1 aspect. I am not 100% surprised that the implementation is different from the 2017 publication (as several things related to SHAP...). Do you understand how they do the constrained optimization in the SHAP package in Python (I mean without the L1 aspect, just the basic logic)?

I think we can still wait with an implementation until we get enough pressure to do so. Here are some reasons I don't like the idea of "sparse interpretation":

  1. If sparsity is important, then is seems better to fit a sparse model rather to explain it in sparse way.
  2. Small SHAP values do not hurt in any of the standard visualizations: In waterfall/force plots, they can be collapsed to a variable "other", in importance bar plots they can also be collapsed as well, and one can simply look at the most important dependence plots and ignore the other.

As far as I can tell, KernelSHAP is basically identical to what you've implemented, minus the iterative component.

You raise some legitimate points re: sparse explanations. I still think there's no harm in having the option though. It's a basic functionality that a lot of users have come to expect from LIME and SHAP. Perhaps a word of caution in the details could address your concerns?

In any event, I've implemented a prototype on this fork. Results appear to look ok in spot checks.

Sweet! Still, let's wait with adding this to the main project. Not everything invented by data scientists is good, think about SMOTE...

I am currently working on the other open issue. It aims at doing better sampling (inspired by Python), but keeping the iterative logic. After two stupid attempts, now I got a very good one, minimizing any inefficiency. The interface is not yet clear, but I will draft a version in the next 1-2 weeks. It is heavily based on your "exact" breakthrough.

Cool! Glad to hear you're making solid progress on the sampling issue. I agree the hybrid strategy definitely sounds like the way to go.

Reading up a bit more on the SHAP documentation, it sounds like the L1 penalty is most effective when $m \ll 2^p$, as may well be the case when $p$ is large. Specifically, the default behavior is to use lasso when $m \leq 0.2 \times 2^p$. This not only produces a more "manageable" explanation, but should also cut down on variance since we've only sampled a small portion of the possible $z$-vectors (though of course this comes at the cost of inducing some bias).

Thanks for the insights! The sweet thing about the upcoming version is: thanks to the hybrid strategy, most part of the mass of the $2^p$ vectors is covered by exact calculations. The rest is controlled by iterating until convergence. Thus, I don't think the $m \le 0.2 \cdot 2^p$ rule will be relevant for us.

Will close this for the moment.