Can we use fastshap to explain isolation forest in R?
oabhitej opened this issue · 6 comments
Currently, Fastshap is only designed for supervised learning as per the package description. Can we use this package to explain unsupervised learning algorithms like isolation forest? I see we could already use isoTree with shapper package and getting similar functionality on fastshap would be amazing.
@oabhitej Yes you can. You can technically use fastshap (or any other package that supports Shapley-like explanations, like iml or iBreakDown) to explain any model that can produce scores/predictions for new data. And while I haven't seen it used in this context, I think it makes perfect sense. Here's an example from an upcoming book I'm writing on tree-based methods for CRC Press using fastshap to explain observations with high anomaly scores from an isolation forest using a well-known credit card fraud data set:
library(isotree)
ccfraud <- data.table::fread("../data/ccfraud.csv") # https://www.kaggle.com/mlg-ulb/creditcardfraud
# Randomize the data
set.seed(2117) # for reproducibility
ccfraud <- ccfraud[sample(nrow(ccfraud)), ]
# Split data into train/test sets
set.seed(2013) # for reproducibility
trn.id <- sample(nrow(ccfraud), size = 10000, replace = FALSE)
ccfraud.trn <- ccfraud[trn.id, ]
ccfraud.tst <- ccfraud[-trn.id, ]
# Fit a default isolation forest
ifo <- isolation.forest(ccfraud.trn[, 1L:30L], random_seed = 2223, nthreads = 1)
# Compute anomaly scores for the test observations
head(scores <- predict(ifo, newdata = ccfraud.tst))
# Training set anomaly scores
scores.trn <- predict(ifo, newdata = ccfraud.trn)
to.explain <- max(scores) - mean(scores.trn)
max.id <- which.max(scores) # row ID for observation wit
max.x <- ccfraud.tst[max.id, ]
max(scores)
max.x # observation to "explain" or compute feature contributions for
X <- ccfraud.trn[, 1L:30L] # feature columns only
max.x <- max.x[, 1L:30L] # feature columns only!
pfun <- function(object, newdata) { # prediction wrapper
predict(object, newdata = newdata)
}
# Generate feature contributions
set.seed(1351) # for reproducibility
(ex <- fastshap::explain(ifo, X = X, newdata = max.x, pred_wrapper = pfun,
adjust = TRUE, nsim = 1000))
sum(ex) # should sum to f(x) - baseline whenever `adjust = TRUE`
# Transpose feature contributions
res <- data.frame(
"feature" = paste0(names(ex), "=", round(max.x, digits = 2)),
"shapley.value" = as.numeric(as.vector(ex[1L,]))
)
# Plot feature contributions
ggplot(res, aes(x = shapley.value, y = reorder(feature, shapley.value))) +
geom_point() +
geom_vline(xintercept = 0, linetype = "dashed") +
xlab("Shapley value") +
ylab("") +
theme(axis.text.y = element_text(size = rel(0.8)))
The interpretation of the output here is a bit moot since the feature names have been anonymized, but it illustrates the idea that feature contributions can be useful in explaining anomaly scores.
Thank you @bgreenwell , I know it is a lot to ask but do you also have a planned future release for Treeshap implementation within the fastshap package?
A generic implementation is not on the roadmap, but fastshap does support TreeSHAP for xgboost and lightgbm models.
I suspect you can use TreeSHAP with sklearn’s isolation forest. Wouldn’t be hard to wrap all of that in R using reticulate.