bgreenwell/fastshap

Sum of SHAP values not equal to `pred - mean(pred)` when `exact = TRUE`

dfsnow opened this issue · 2 comments

Hi! Thanks for the great package. I want to clarify a point of confusion I have before proceeding. I found the sample code you posted here and ran it locally. Quick reprex:

library(xgboost)
library(fastshap)
library(SHAPforxgboost) #to load the dataX

y_var <-  "diffcwv"
dataX <- as.matrix(dataXY_df[,-..y_var])

# hyperparameter tuning results
param_list <- list(objective = "reg:squarederror",  # For regression
                   eta = 0.02,
                   max_depth = 10,
                   gamma = 0.01,
                   subsample = 0.95
)
mod <- xgboost(data = dataX, label = as.matrix(dataXY_df[[y_var]]), 
               params = param_list, nrounds = 10, verbose = FALSE, 
               nthread = parallel::detectCores() - 2, early_stopping_rounds = 8)

# Grab SHAP values directly from XGBoost
shap <- predict(mod, newdata = dataX, predcontrib = TRUE)

# Compute shapley values 
shap2 <- explain(mod, X = dataX, exact = TRUE, adjust = TRUE)

# Compute bias term; difference between predictions and sum of SHAP values
pred <- predict(mod, newdata = dataX)
head(bias <- pred - rowSums(shap2))
#> [1] 0.4174776 0.4174775 0.4174775 0.4174775 0.4174775 0.4174776

# Compare to output from XGBoost
head(shap[, "BIAS"])
#> [1] 0.4174775 0.4174775 0.4174775 0.4174775 0.4174775 0.4174775

# Check that SHAP values sum to the difference between pred and mean(pred)
head(cbind(rowSums(shap2), pred - mean(pred)))
#>             [,1]        [,2]
#> [1,] -0.03048085 -0.03053582
#> [2,] -0.08669319 -0.08674819
#> [3,] -0.05410853 -0.05416352
#> [4,] -0.09465271 -0.09470773
#> [5,] -0.01655553 -0.01661054
#> [6,] -0.01729831 -0.01735327

In this code, the SHAP values' sum is not equal to the difference between pred and mean(pred) as suggested. Instead the SHAP values' sum is (nearly) equal to the BIAS term from the stats::predict(object, X, predcontrib = TRUE, ...) call in explain.xgb.Booster when exact = TRUE.

# Compare pred - BIAS from shap2
head(cbind(rowSums(shap2), pred - attributes(shap2)$baseline))
#>             [,1]        [,2]
#> [1,] -0.03048085 -0.03048083
#> [2,] -0.08669319 -0.08669320
#> [3,] -0.05410853 -0.05410853
#> [4,] -0.09465271 -0.09465274
#> [5,] -0.01655553 -0.01655555
#> [6,] -0.01729831 -0.01729828

So, quick questions:

  1. Should adjust = TRUE have the same effect for exact = TRUE output as it does for exact = FALSE output? In the line above (explain(mod, X = dataX, exact = TRUE, adjust = TRUE)), adjust = TRUE has no function. Is is simply passed on to the predict method of xgb.Booster and silently swallowed. Is this the intended behavior?
  2. Can you briefly explain the difference between the baseline/bias term (produced by predict(xgb.Booster, newdata = X, predcontrib = FALSE) as the last matrix column) and mean(prediction)? I scoured the xgboost/lightgbm docs but couldn't find much.

Hi @dfsnow, thanks for the note. Setting adjust = TRUE has no affect on the output when using exact = TRUE since they are already supposed to be additive. I'm not sure why the SHAP values aren't additive here (and I get the same issue when using XGBoost directly), so it may be better to ask on the XGBoost issues page. The bias column/term should be the average of all the training predictions (i.e., E(f(x))), which also corresponds to the difference between a particular prediction and the sum of its corresponding Shapley values.

Interesting. For what it's worth, this issue is also true of LightGBM. I'll make a quick issue on the xgboost repo. Thanks!