bgreenwell/pdp

partial() function giving inconsistent results when feeding multiple pred.var variables

rahualram opened this issue · 1 comments

Hello,

I've been running into an issue with the partial function when using it across multiple features with a custom grid table. I've noticed that the partial function works well with a single row in the train dataset but doesn't seem to give sensible results when you add more rows and vary multiple features. Here is an example:

# Load required packages
library(pdp)
library(xgboost)

# Load the data
data(pima)
X <- subset(pima, select = -diabetes)
y <- ifelse(pima$diabetes == "pos", 1, 0)

# Parameters for XGBoost model
param.list <- list(max_depth = 5, eta = 0.01, objective = "binary:logistic", 
                   eval_metric = "auc")

# Fit an XGBoost model
set.seed(101)
pima.xgb <- xgb.train(params = param.list, 
                      data = xgb.DMatrix(data.matrix(X), label = y), 
                      nrounds = 500)

grid <- data.table(mass = c(20,40), age = c(30, 50))

all_predict <- cbind(X, prediction = predict(pima.xgb, xgb.DMatrix(data.matrix(X), label = y)))

person_1 <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[1,], pred.grid = grid)

person_2 <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[2,], pred.grid = grid)

both_ppl <- partial(pima.xgb, pred.var = c("mass", "age"), train = X[c(1,2),], pred.grid = grid)

results <- merge(person_1, person_2, c("mass", "age"))
results <- as.data.table(merge(results, both_ppl, c("mass", "age")))
results <- results %>% rename("person1" = "yhat.x", "person2" = "yhat.y", "both" = "yhat")
results[,avg := (person1+person2)/2]

view(results)

The above shows that when using the partial function for more than one row, the result deviates from the average prediction of each row. I've had a look into what is causing this and at first I thought it was due to the differences in xgb.DMatrix verse data.matrix but I ended up replicating the error by pulling apart the pardep function a bit. I noticed that the following line of code in the pardep function seems to be causing the issue when pred.var is more than one feature:

temp[, pred.var] <- pred.grid[i, pred.var]

It seems to be that the assignment of the pred.grid to a data.matrix which is then subset to a single row somehow transforms this into a single array which is used to populate temp by column rather than row. Using the example above, my dataset is meant to look like

          Mass        Age
Row 1:      20            30
Row 2:      20            30

when running the first grid point (i.e i =1 in the foreach loop).
But what I'm seeing when stepping through is:

           Mass        Age
Row 1:      20            20
Row 2:      30            30

Is this a known issue or am I doing something wrong when feeding in the parameters to partial?

Hi @rahualram, thanks for the note. I'll look into this shortly!