/subSAGE

Inferring feature importance in high-dimensional data

Primary LanguageR

subSAGE

subSAGE is a Shapley value based framework to infer feature importance in high-dimensional data. It is based on SAGE (Shapley Additive Global importancE), but adjusted for high-dimensional data. We also demonstrate how to perform paired bootstrapping in order to estimate confidence intervals. We investimate in particular subSAGE applied on tree ensemble models. We emphasize the importance of computing subSAGE on independent test data not used during training of the model.

Preprint

Preprint is available here.

Usage

Given an xgboost-model, test data, and a particular feature, the subSAGE estimate can be computed, in R, as:

source("~/subSAGE/subSAGE.R")
t = xgb.model.dt.tree(model = model)
trees = as.data.table(xgboost.trees(xgb_model = model, data = data, recalculate = FALSE))
estimate = subSage_cpp(data,trees,feature,loss = "RMSE")