Approximate method with glm model returns zeros.
esther-meerwijk opened this issue · 3 comments
I've been perusing various sites that describe how to determine approximate values with fastshap for a binomial glm model, but so far have been unsuccessful in making it work. Here's what I have been using:
x1 <- c(1,1,1,0,0,0,0,0,0,0)
x2 <- c(1,0,0,1,1,1,0,0,0,0)
x3 <- c(3,2,1,3,2,1,3,2,1,3)
x4 <- c(1,0,1,1,0,1,0,1,0,1)
y <- c(1,0,1,0,1,1,0,0,0,1)
df <- data.frame(x1, x2, x3, x4, y)
fit <- glm(y ~ ., data=df, family=binomial)
X <- model.matrix(y ~., df)[,-1]
pfun <- function(object, newdata) {
predict(object, type="response")
}
shap <- explain(fit , X = X, pred_wrapper = pfun, nsim = 10)
Here's the result:
> summary(shap)
x1 x2 x3 x4
Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0
Obviously not what I expect. With the exact method, I do get values that make sense:
shap <- explain(fit , X = X, exact=TRUE, nsim = 10)
summary(shap)
x1 x2 x3 x4
Min. :-0.3659 Min. :-0.8149 Min. :-0.62699 Min. :-1.0497
1st Qu.:-0.3659 1st Qu.:-0.8149 1st Qu.:-0.62699 1st Qu.:-1.0497
Median :-0.3659 Median :-0.8149 Median : 0.06967 Median : 0.6998
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.5489 3rd Qu.: 1.2223 3rd Qu.: 0.59215 3rd Qu.: 0.6998
Max. : 0.8538 Max. : 1.2223 Max. : 0.76632 Max. : 0.6998
but I cannot use the exact method on my actual data because the model features are not independent. Any help getting this to work would be appreciated!
Hi @esther-meerwijk, I just ran your code and I get the same results…very strange. I’m on vacation but will try to figure out what’s going on later this week.
Hi @esther-meerwijk, couple of small tweaks to fix your script:
- you forgot to pass
newdata
in our definition ofpfun()
; X
, in this case, needs to be a data frame (because GLMs can only predict on data frames);- For consistency between the results with the exact method (which are based on the coefficient and are on the link scale) and approximate method, you should use
type = 'link'
instead.
Code and output below:
x1 <- c(1,1,1,0,0,0,0,0,0,0)
x2 <- c(1,0,0,1,1,1,0,0,0,0)
x3 <- c(3,2,1,3,2,1,3,2,1,3)
x4 <- c(1,0,1,1,0,1,0,1,0,1)
y <- c(1,0,1,0,1,1,0,0,0,1)
df <- data.frame(x1, x2, x3, x4, y)
X <- subset(df, select = -y) # features only
fit <- glm(y ~ ., data=df, family=binomial)
pfun <- function(object, newdata) {
predict(object, type = "link", newdata = newdata)
}
set.seed(845) # for reproduicibility
head(shap1 <- explain(fit , X = X, pred_wrapper = pfun, nsim = 1000))
# # A tibble: 6 × 4
# x1 x2 x3 x4
# <dbl> <dbl> <dbl> <dbl>
# 1 0.853 1.22 -0.639 0.723
# 2 0.848 -0.807 0.0390 -1.03
# 3 0.868 -0.854 0.748 0.696
# 4 -0.379 1.24 -0.601 0.682
# 5 -0.392 1.17 0.0620 -1.01
# 6 -0.381 1.24 0.777 0.693
head(shap2 <- explain(fit , X = X, exact = TRUE))
# A tibble: 6 × 4
# x1 x2 x3 x4
# <dbl> <dbl> <dbl> <dbl>
# 1 0.854 1.22 -0.627 0.700
# 2 0.854 -0.815 0.0697 -1.05
# 3 0.854 -0.815 0.766 0.700
# 4 -0.366 1.22 -0.627 0.700
# 5 -0.366 1.22 0.0697 -1.05
# 6 -0.366 1.22 0.766 0.700
Yep, that does it 👍 Thanks so much for figuring that out!