pred_grid() breaks when using output from caret::train() unless all explanatory vars are cast as factors
alexkrolak opened this issue · 4 comments
After fighting with partial for the greater part of today, I've come to realize that pred_grid() - called by partial() - expects all of the explanatory variables as being cast as factor()s.
I created a binary classifier via caret's train() function, and tested partial() on all of the example cases - which worked fine. However, the only way I could force it to work with my actual data was to pre-cast all explanatory variables to factor()s before running caret::train(), then put the result into partial(). Even after trying to utilize the "cats" argument, I kept getting the same error messages (below). It seems like any data.frame/data.table won't utilize the "cats" argument, and I don't know if this is intentional. Perhaps it ought to be usable for these classes as well? I'm not sure if you're expecting all qualitative predictor variables to be factors already or not either. Ideally that would not be the case, and the cats argument would be able to be utilized for this sort of situation.
From partial's documentation:
Character string indicating which columns of train should be treated as categorical variables. Only used when train inherits from class "matrix" or "dgCMatrix".
wc_med_fit_new_test$trainingData %>% class
[1] "data.table" "data.frame"
wc_med_fit_new_test %>% class
[1] "train" "train.formula"
partial(wc_med_fit_new_test, pred.var = "age_cut", cats="factor")
Error in seq.default(from = min(y, na.rm = TRUE), to = max(y, na.rm = TRUE), :
'from' must be a finite number
partial(wc_med_fit_new_test, pred.var = "age_cut", cats="character")
Error in seq.default(from = min(y, na.rm = TRUE), to = max(y, na.rm = TRUE), :
'from' must be a finite number
@alexkrolak Thank you for reporting the issue! It is difficult to say what the specific cause is without a reproducible example, would you mind posing one? In any case, it looks like you're using the cats
argument incorrectly (perhaps the documentation could be improved here). The cats
argument specifies the column names listed in pred.var
which should be treated as categorical if they are not factors. For instance:
df <- data.frame(x1 = 1:3, x2 = c("a", "b", "a"), x3 = 5:7, stringsAsFactors = FALSE)
# This fails bc "x2" is character, not a factor
pdp:::pred_grid(df, pred.var = "x2")
# Error in seq.default(from = min(y, na.rm = TRUE), to = max(y, na.rm = TRUE), :
# 'from' must be a finite number
# In addition: Warning message:
# In seq.default(from = min(y, na.rm = TRUE), to = max(y, na.rm = TRUE), :
# This works
pdp:::pred_grid(df, pred.var = "x2", cats = "x2")
# x2
# 1 a
# 2 b
So I suspect the following should work for you:
partial(wc_med_fit_new_test, pred.var = "age_cut", cats="age_cut")
If this is the case, what is the class of column "age_cut"
? If it is character, this should be an easy fix to avoid having to specify the cats
argument.
Also, it looks like the cats
argument in partial()
never got passed to pred_grid()
. I just pushed a fix to the dev version on GitHub. Let me know if it still does not work for you.
Thanks! The dev version of pdp that passes cats into pred_grid() has helped.
Also, changing my "cats" argument to the actual column names in the function call helped.
I'm not sure, but it's possible my factor variables are being recoded within the partial() function call somewhere, and I'm getting really extreme partial effects for some of my variables. Some much larger than I'd expect based on the logistic's coefficients/their odds ratios.
Partial doesn’t recode anything, but it’s difficult to say for certain without an example. If you can post one in a new issue, please do and I’ll be sure to look into it ASAP.