How to get Partial Dependence of target encoding based categorical feature
kaoribundo opened this issue · 3 comments
issue : 'pdp' library is very very useful, but I have a problem in one case.
In order to fit xgboost model, I have translated categorical features to numeric feature by useing Target encoding . Then I used 'partial' function with 'cats' arguments in the feature. I supposed yhat correspond to each target encode value, but the feature value from partial was different. (The original value and partial value did not match. )
This is the example codes when I faced to the problem.
Please tell me how to solve this problem.
## Load Required packages
library(dplyr)
library(xgboost)
library(pdp)
## data (example)
## real data has more features
> head(data_example)
objective numeric_feature one_hot_encoding_feature target_encoding_feature
1 1 392 0 6.077463e-05
2 1 765 0 2.891865e-03
3 1 643 0 3.254317e-03
4 0 330 0 5.517329e-05
5 0 194 0 1.075839e-05
6 0 194 0 1.372488e-05
## Modeling
### translate dataframe to xgb.DMatrix
train_data <- xgb.DMatrix(as.matrix(
dplyr::select(data_example , -objective)))
### Fit Xgboost Model
xgb_model <- xgboost(
data = train_data
,param = list("objective" = "binary:logistic")
)
## Get Partial Dependence of each feature
## numeric_feature
## it worked !!
p_numeric <- partial(xgb_model
,pred.var = "numeric_feature"
,train = as.matrix(data_example[,-"objective"])
,type = "regression"
)
>head(p_numeric)
numeric_feature yhat
1 0.00 0.03914417
2 30.38 0.03359368
3 60.76 0.03358342
4 91.14 0.03340583
5 121.52 0.03329919
6 151.90 0.03330897
## one_hot_encoding_feature
## it worked !!
p_onehot <- partial(xgb_model
,pred.var = "one_hot_encoding_feature"
,train = as.matrix(data_example[,-"objective"])
,type = "regression"
,cat = "one_hot_encoding_feature"
)
>p_onehot
one_hot_encoding_feature yhat
1 0 0.03315646
2 1 0.03301551
## target_encoding_feature
## faced to the problem !
p_onehot <- partial(xgb_model
,pred.var = "target_encoding_feature"
,train = as.matrix(data_example[,-"objective"])
,type = "regression"
,cat = "target_encoding_feature"
)
> head(p_target)
target_encoding_feature yhat
1 1.075839e-05 0.02349093
2 9.605801e-04 0.05872103
3 1.910402e-03 0.05886231
4 2.860224e-03 0.06028017
5 3.810045e-03 0.06624583
6 4.759867e-03 0.06624583
## comapare original target encoding value and partial output
## I would like to get yhat correspond to original value
>sort(unique(data_example$target_encoding_feature))
[1] 1.075839e-05 1.294359e-05 1.360556e-05 1.372488e-05 1.468446e-05 1.509768e-05 1.766756e-05 5.517329e-05
[9] 6.077463e-05 6.478200e-05 6.573776e-05 7.262987e-05 7.770370e-05 7.959780e-05 8.514537e-05 8.717332e-05
[17] 1.070893e-04 1.175257e-04 1.351717e-04 1.626339e-04 1.948620e-04 1.994245e-04 2.062776e-04 2.140787e-04
[25] 2.141661e-04 2.166361e-04 2.241656e-04 2.481869e-04 3.676495e-04 3.923796e-04 4.283972e-04 4.383589e-04
[33] 4.499127e-04 4.738567e-04 5.141846e-04 7.232350e-04 7.588255e-04 7.852622e-04 8.776138e-04 9.964129e-04
[41] 1.066354e-03 1.074656e-03 2.891865e-03 2.905396e-03 3.172273e-03 3.237116e-03 3.254317e-03 3.308820e-03
[49] 3.401120e-03 9.411765e-03 9.624639e-03 1.082056e-02 1.123596e-02 1.181525e-02 1.377727e-02 1.910828e-02
[57] 2.047981e-02 2.286483e-02 2.544910e-02 2.588556e-02 2.633559e-02 2.978723e-02 3.880901e-02 4.027976e-02
[65] 4.114286e-02 4.155194e-02 4.496066e-02 4.574758e-02 4.750185e-02
>p_target$target_encoding_feature
[1] 1.075839e-05 9.605801e-04 1.910402e-03 2.860224e-03 3.810045e-03 4.759867e-03 5.709689e-03 6.659511e-03
[9] 7.609332e-03 8.559154e-03 9.508976e-03 1.045880e-02 1.140862e-02 1.235844e-02 1.330826e-02 1.425808e-02
[17] 1.520791e-02 1.615773e-02 1.710755e-02 1.805737e-02 1.900719e-02 1.995702e-02 2.090684e-02 2.185666e-02
[25] 2.280648e-02 2.375630e-02 2.470612e-02 2.565595e-02 2.660577e-02 2.755559e-02 2.850541e-02 2.945523e-02
[33] 3.040505e-02 3.135488e-02 3.230470e-02 3.325452e-02 3.420434e-02 3.515416e-02 3.610398e-02 3.705381e-02
[41] 3.800363e-02 3.895345e-02 3.990327e-02 4.085309e-02 4.180292e-02 4.275274e-02 4.370256e-02 4.465238e-02
[49] 4.560220e-02 4.655202e-02 4.750185e-02
Thank you for reading this issue.
Hi @kaoribundo. Perhaps I'm not fully understanding the issue. Why would you expect the partial dependence (i.e., yhat
) to match the encoded feature values?
Hello @bgreenwell . Thank you for your reply. And sorry for my poor explanation ...
For example , I used age_range
feature and would like to understand the partial dependence of each category (like 10's , 20's and so on ...)
But I translated this categorical feature as numeric feature by using target based encoding, (using the occurance probability toward the objective feature of each category , like 10's → 0.25, 20's → 0.5)
and fit the xgboost model .
In order to interpret the behaviour of the model on each categorical feature (like 10's and 20's) , I think we should know the yhat of 0.25 , 0.5.
I have few knowledge of interpretable machine learning , so maybe I am saying unreasonable thing ...
Hi @bgreenwell
I solved my problem by using 'pred.grid' arugments , and get the yhat which matched the target encoded categorical feature .
I looked your code and understand that if 'pred.grid' arguments is missing , the 'pred.val' function compensate the pred.grid values. And also when 'cats' arguments is used, the unique values of the feature would be used in pred.grid.
Lines 25 to 27 in a7f755c
This seems to work for numeric features , but I could not get the results which I expected. But anyway,
I got the expected results by using 'pred.grid' arugments !
Thank you for your time , and I will close this issue.