~4000 0bservations-- tibble giving one row of outputs
jg3405 opened this issue · 14 comments
Hi I'm trying to generate the shapley values for a dataset with 14 features and around 4000 observations (dataframe with 14 columns and 4000 rows). I create a model using ranger, but then when I go to generate my shapley values I only end up receiving a one row output.... I get different values depending on how many observations I include... but still just one row output. Please help.
Hi @jg3405, in order to help, I’d need to see a reproducible example. Can you try posting one here?
Hi Brian,
Here is an example of a RF model I generated using the ranger package in R.
The data used for the model is also included. As you can see, the response variable is factorial (only 0s and 1s). I would like to get shapley values for each observation, but when using the code:
pfun = function(object, newdata, type) predict(object, data = newdata,type="response")$predictions
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper=pfun)
I only get one row of data as output as though I had one observation. Thanks for trying to help.
Julia
Thanks for the example code @jg3405. The issue is that your prediction wrapper is returning probability estimates for both classes, but explain()
can only work with one predicted output. All you need to do is specify which class probability you want to explain:
pfun <- function(object, newdata, type) {
# Return probability estimate for Y = 1 (i.e., the second column)
predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")
I'll try to add an informative error message for this in the next release!
Amazing! Thank you for getting back to me so fast! It works like a charm :)
The same error comes when using glmnet
, where the prediction function needs to be function(mdl, newdata) as.numeric(predict(mdl, newdata, s = best_lambda)))
(notice the as.numeric()
, otherwise the predict function would return a numerical matrix and not a vector, resulting in the single-row output).
Btw. thanks for creating such a wonderful package, greatly appreciated!
Full MWE below:
library(fastshap)
library(AmesHousing)
library(glmnet)
data <- AmesHousing::make_ames()
dim(data)
#> [1] 2930 81
y <- data$Sale_Price
X <- as.matrix(data[, c("Lot_Frontage", "Lot_Area", "Year_Sold", "Year_Built")])
glm <- cv.glmnet(X, y)
expl_glm <- fastshap::explain(
glm, X = X,
pred_wrapper = function(mdl, newdata) as.numeric(predict(mdl, newdata, s = glm$lambda.min))
)
sapply(expl_glm, function(x) mean(abs(x)))
#> Lot_Frontage Lot_Area Year_Sold Year_Built
#> 81.60870 31.03433 3.05142 908.30222
Question related to the above
I am trying to use FastShap after a tidymodels workflow with xgboost for a binary classification. My confusion is what is fastshap using as the "event" or 'positive' classification (with or without exact = FALSE). In yardstick, the default value is the first level. I suspect that the default is the second level in xgboost/fastShap? The suggestion above for the pred wrapper does not work to select the event of interest - because the predict for xgboost model only gives one set of predictions.
pfun <- function(object, newdata, type) {
# Return probability estimate for Y = 1 (i.e., the second column)
predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")
``` does not work.
My code is passing the xgbooster model and a matrix of data to fastshap::explain
final_fit <- finalize_workflow(wflw_spec_xgb, select_best(tune_results_xgb)) %>%
fit(dfFinal)
final_model <- pull_workflow_fit(final_fit)
X <- prep(xgboost_recipe, dfFinal) %>%
juice() %>%
select(-Class_Pr) %>%
as.matrix()
shap <- fastshap::explain(final_model$fit, X = X,
exact = TRUE)
Hi @trilliumtechnical, it is difficult to provide a good answer without a reproducible example, but in general, the reference class for the Shapley values is dictated by the probability estimates output by the prediction wrapper (as supplied to pred_wrappper
)---in your example above, it would be whichever class is stored in the second column, which you extract in your definition of pfun()
. In XGBoost, binary outcomes have to be encoded as 0/1 and so the reference class is always whichever class is coded as 1. Not sure what tidymodels does during the fitting process with XGBoost, but eventually the outcome has to get encoded as 0/1. Does that help answer your question?
@bgreenwell
Thank you for the fast response. I think I understand. I am providing what I think is simple, reproducible example, perhaps to help others, that verifies that xgboost uses whatever column is coded as 1 although the predict function only returns one column. Note: I excluded all the tuning (the reason for the workflow) to keep the example simple.
library(tidyverse)
library(tidymodels)
library(xgboost)
library(fastshap)
dfFinal <- read_rds(file.path(here::here(), 'dfFinal.RDS'))
# Create a recipe
xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>%
step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'), one_hot = TRUE)# %>%
# Create a workflow
wflw_spec_xgb <- workflow() %>%
add_model(
spec = boost_tree(
mode = "classification",
min_n = 2,
learn_rate = 5,
trees = 100
) %>%
set_engine("xgboost", objective = 'binary:logistic')
) %>%
add_recipe(xgboost_recipe)
# Fit the original model
final_fit <- wflw_spec_xgb %>% fit(dfFinal)
# Create a matrix from the data
X <- prep(xgboost_recipe, dfFinal) %>%
juice() %>%
select(-Class_Pr) %>%
as.matrix()
# Extract the xgbooster model
final_model <- pull_workflow_fit(final_fit)
# final_model$fit is a xgbooster model
# this predict statement returns a vector of probabilities
# This is the probabilities for the second class of 1
predict(final_model$fit, X)
predict(final_fit, dfFinal, type = 'prob')
pfun <-function(object, newdata) {
y <- predict(object, newdata = newdata)
}
shap <- fastshap::explain(final_model$fit, X = X, pred_wrapper = pfun, nsim = 10)
shap %>% autoplot()
Follow up question, is it possible to write a prediction function with fastshap that uses a workflow. My attempt failed, possibly due to the problem with tibbles? #20
Perhaps this example will help with the tibble issue?
# Try to write a wrapper function that uses the workflow
pfunWF <-function(object, newdata) {
y <- object %>% predict(newdata, type = 'prob') %>% pull(., 2)
}
# Does prediction function work?
yyy <- pfunWF(final_fitWF, dfFinal %>% select(-Class_Pr))
# final_fit is a workflow
predict(final_fit, dfFinal, type = 'prob')
shap <- fastshap::explain(final_fit, X = dfFinal, pred_wrapper = pfunWF, nsim = 10)
#Error: Can't combine `Variety` <factor<ab9b3>> and `Plant.Day` <double>.
# Tried to correct the problem by changing everything to numeric in the recipe
xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>%
step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'), one_hot = TRUE) %>%
step_mutate_at(all_nominal(), -all_outcomes(), fn = as.numeric)
# Check the result: Are all features numeric
dfCheck <- bake(xgboost_recipe %>% prep(), new_data = NULL)
glimpse(dfCheck)
# Fit the original model with new recipe
final_fitWF <- wflw_spec_xgb %>% fit(dfFinal)
predict(final_fitWF, dfFinal, type = 'prob')
shap <- fastshap::explain(final_fitWF, X = dfFinal %>% select(-Class_Pr), pred_wrapper = pfunWF, nsim = 10)
# Still get the error.
@trilliumtechnical thanks for including an example. I see two issues with your code:
pfunWF()
does not return anything (so simply omit they <-
part).- Coerce the data to a data frame when passing into
explain()
, e.g., just useX = as.data.frame(dfFinal)
and it will work (at least it did for me).
Adding tibble support isn't out of the question, but I'm not sure how difficult that change would be to implement (fastshapp uses logical subsetting and some other operations that may not be allowed on tibbles). For now, it's easy enough to drop the tibble class before passing it in. Also, XGBoost has "exact" Shapley values built-in (so-called TreeSHAP), which can be obtained even faster by setting exact = TRUE
in explain()
(which just calls predict(xgb_object, newdata = X, predcontrib = TRUE)
under the hood), but this will only work if given the XGBoost object directly. Although, I'm sure it's possible to handle WF objects too if it allows you to call predict(xgbobject, newdata = X, predcontrib = TRUE)
. Maybe @topepo can chime in on that part.
Hello everyone, I am also experiencing difficulties on how to implement the explain() function to a fitted xgboost model from the mlr package. The error reads: Error in UseMethod("explain") :
no applicable method for 'explain' applied to an object of class "WrappedModel"
Is it not possible to apply SHAP values to a fitted xgboost model from the mlr package?
For a simplified version of the fitted model, the code looks accordingly:
train_df <- as.data.frame(train_df)
test_df <- as.data.frame(test_df)
traintask <- makeClassifTask(data = train_df, target = "SeriousDlqin2yrs")
testtask <- makeClassifTask(data = test_df, target = "SeriousDlqin2yrs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 200
)
)
xgb_model <- mlr::train(xgb_learner, task = traintask)
####### SHAP (fastshap package) #######
pfunWF <-function(object, newdata) {
predict(xgb_model, newdata)$data$prob.1
}
shap <- explain(xgb_model,
X = train_df,
nsim = 1,
pred_wrapper = pfunWF)
Any help on how to fix this error would be much appreciated!
Hi @Thomassimancik, not sure why you'd be getting this error message since explain()
is a generic and should work as long as you pass in a prediction wrapper. If you post a small reproducible example I can copy and run on my end, I'd be happy to take a closer look.
Hi @bgreenwell, thanks for your swift reply! I am sending a small reproducible example below:
`data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 50
)
)
model <- mlr::train(xgb_learner, task = traintask)
pfunWF <-function(object, newdata) {
predict(object, newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
X = mtcars[, -9],
nsim = 1,
pred_wrapper = pfunWF,
.progress = "text")`
I suspect that it has something to do with the definition of traintask object with the makeClassifTask function.
Two issues in the above code: 1) newdata
is not the second positional argument in mlr's prediction method, so you'll need to name it; 2) you omitted the wrong column number in X
when calling explain()
, so the feature names won't match and you'll get another error. Fixes in the example below.
data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 50
)
)
model <- mlr::train(xgb_learner, task = traintask)
pfunWF <-function(object, newdata) {
# predict(object, newdata, type = 'prob')$data$prob.1
predict(object, newdata = newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
#X = mtcars[, -9],
X = mtcars[, -8],
nsim = 1,
pred_wrapper = pfunWF)