partial() function giving inconsistent results when feeding multiple pred.var variables
rahualram opened this issue · 1 comments
Hello,
I've been running into an issue with the partial function when using it across multiple features with a custom grid table. I've noticed that the partial
function works well with a single row in the train dataset but doesn't seem to give sensible results when you add more rows and vary multiple features. Here is an example:
# Load required packages
library(pdp)
library(xgboost)
# Load the data
data(pima)
X <- subset(pima, select = -diabetes)
y <- ifelse(pima$diabetes == "pos", 1, 0)
# Parameters for XGBoost model
param.list <- list(max_depth = 5, eta = 0.01, objective = "binary:logistic",
eval_metric = "auc")
# Fit an XGBoost model
set.seed(101)
pima.xgb <- xgb.train(params = param.list,
data = xgb.DMatrix(data.matrix(X), label = y),
nrounds = 500)
grid <- data.table(mass = c(20,40), age = c(30, 50))
all_predict <- cbind(X, prediction = predict(pima.xgb, xgb.DMatrix(data.matrix(X), label = y)))
person_1 <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[1,], pred.grid = grid)
person_2 <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[2,], pred.grid = grid)
both_ppl <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[c(1,2),], pred.grid = grid)
results <- merge(person_1, person_2, c("mass", "age"))
results <- as.data.table(merge(results, both_ppl, c("mass", "age")))
results <- results %>% rename("person1" = "yhat.x", "person2" = "yhat.y", "both" = "yhat")
results[,avg := (person1+person2)/2]
view(results)
The above shows that when using the partial
function for more than one row, the result deviates from the average prediction of each row. I've had a look into what is causing this and at first I thought it was due to the differences in xgb.DMatrix
verse data.matrix
but I ended up replicating the error by pulling apart the pardep
function a bit. I noticed that the following line of code in the pardep
function seems to be causing the issue when pred.var
is more than one feature:
temp[, pred.var] <- pred.grid[i, pred.var]
It seems to be that the assignment of the pred.grid
to a data.matrix
which is then subset to a single row somehow transforms this into a single array which is used to populate temp
by column rather than row. Using the example above, my dataset is meant to look like
Mass Age
Row 1: 20 30
Row 2: 20 30
when running the first grid point (i.e i =1 in the foreach
loop).
But what I'm seeing when stepping through is:
Mass Age
Row 1: 20 20
Row 2: 30 30
Is this a known issue or am I doing something wrong when feeding in the parameters to partial
?
Hi @rahualram, thanks for the note. I'll look into this shortly!