Benchmarking different approaches for categorical encoding
numpy==1.15.1
pandas==0.23.4
sklearn==0.20.3
category_encoders==2.0.0
lightgbm==2.2.3
To benchmark endoers for your dataset:
-
Install libraries in requirements
-
Process the dataset as in
prepare_datasets.ipynb
-
Add name of the dataset in
dataset_list
inrun_experiment.py
-
python run_experiment.py
-
Run
show_results.ipynb
All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.
Table 1.1 Used datasets
Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description |
---|---|---|---|---|---|---|
Telecom | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data |
Adult | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons' income is bigger 50k |
Employee | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee's access needs, given his/her job role |
Credit | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment |
Mortgages | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded |
Promotion | 54.8 | 32.8k | 21.9k | 13 | 5 | Predict if an employee will get a promotion |
Kick | 72.9k | 43.7k | 29.1k | 32 | 19 | Predict if a car purchased at auction is good/bad buy |
Kdd_upselling | 50k | 30k | 20k | 230 | 40 | Predict up-selling for a customer |
Taxi | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver |
Poverty_A | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not |
Poverty_B | 20.2k | 12.1k | 8.1k | 224 | 191 | Predict whether or not a given household for a given country is poor or not |
Poverty_C | 29.9k | 17.9k | 11.9k | 41 | 35 | Predict whether or not a given household for a given country is poor or not |
The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.
Table 1.2 ROC AUC scores for None Validation
telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BackwardDifferenceEncoder | 0.6454 | 0.8555 | 0.5006 | 0.7442 | 0.5997 | 0.6482 | 0.5149 | 0.5484 | 0.4945 | |||
CatBoostEncoder | 0.7666 | 0.868 | 0.5004 | 0.7478 | 0.6279 | 0.7811 | 0.6583 | 0.8549 | 0.5477 | 0.5179 | 0.5638 | 0.5427 |
FrequencyEncoder | 0.8405 | 0.9291 | 0.807 | 0.7593 | 0.6949 | 0.9052 | 0.7907 | 0.8643 | 0.5656 | 0.7276 | 0.6164 | 0.7177 |
HelmertEncoder | 0.8404 | 0.9297 | 0.83 | 0.7601 | 0.7001 | 0.9079 | 0.7325 | 0.6343 | 0.7168 | |||
JamesSteinEncoder | 0.7195 | 0.8688 | 0.5003 | 0.7485 | 0.6049 | 0.7984 | 0.6592 | 0.8516 | 0.5432 | 0.4918 | 0.5304 | 0.4836 |
LeaveOneOutEncoder | 0.5 | 0.5214 | 0.6233 | 0.4957 | 0.5 | 0.5457 | 0.5027 | 0.5 | 0.5 | 0.5006 | 0.5002 | 0.4527 |
MEstimateEncoder | 0.6944 | 0.8617 | 0.4998 | 0.7368 | 0.6086 | 0.8156 | 0.653 | 0.8448 | 0.5091 | 0.5254 | 0.434 | 0.4528 |
OrdinalEncoder | 0.7409 | 0.8616 | 0.501 | 0.7445 | 0.6008 | 0.7124 | 0.6531 | 0.8448 | 0.5498 | 0.473 | 0.4683 | 0.5611 |
SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | 0.7355 | 0.6206 | 0.7372 | |||
TargetEncoder | 0.7195 | 0.8696 | 0.5003 | 0.7483 | 0.6064 | 0.7971 | 0.6594 | 0.8483 | 0.5428 | 0.4955 | 0.5401 | 0.4751 |
WOEEncoder | 0.7056 | 0.8645 | 0.5012 | 0.7439 | 0.615 | 0.7345 | 0.6398 | 0.844 | 0.5485 | 0.478 | 0.5356 | 0.4671 |
Table 1.3 ROC AUC scores for Single Validation
telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BackwardDifferenceEncoder | 0.8382 | 0.9293 | 0.7569 | 0.7595 | 0.6894 | 0.9064 | 0.7323 | 0.6151 | 0.7108 | |||
CatBoostEncoder | 0.8392 | 0.9292 | 0.8498 | 0.7594 | 0.6951 | 0.8918 | 0.7901 | 0.8654 | 0.5844 | 0.7429 | 0.6902 | 0.7333 |
FrequencyEncoder | 0.8392 | 0.9293 | 0.8138 | 0.7592 | 0.6937 | 0.9055 | 0.7902 | 0.8634 | 0.582 | 0.7302 | 0.6128 | 0.7195 |
HelmertEncoder | 0.8404 | 0.9297 | 0.8344 | 0.7597 | 0.7027 | 0.9083 | 0.7297 | 0.6374 | 0.7196 | |||
JamesSteinEncoder | 0.8388 | 0.9292 | 0.7817 | 0.7597 | 0.667 | 0.9053 | 0.5835 | 0.726 | 0.5898 | 0.7303 | 0.6764 | 0.7217 |
LeaveOneOutEncoder | 0.5 | 0.5182 | 0.6121 | 0.4997 | 0.5 | 0.5403 | 0.4682 | 0.5 | 0.5 | 0.5103 | 0.5 | 0.4959 |
MEstimateEncoder | 0.8394 | 0.929 | 0.7353 | 0.7593 | 0.6957 | 0.9054 | 0.5877 | 0.5953 | 0.5946 | 0.7302 | 0.6493 | 0.7076 |
OrdinalEncoder | 0.8404 | 0.9299 | 0.8274 | 0.7585 | 0.6917 | 0.9078 | 0.7809 | 0.8465 | 0.6034 | 0.7337 | 0.6635 | 0.742 |
SumEncoder | 0.8404 | 0.929 | 0.8053 | 0.7593 | 0.6944 | 0.9073 | 0.7355 | 0.6206 | 0.7372 | |||
TargetEncoder | 0.8388 | 0.9293 | 0.815 | 0.7599 | 0.6702 | 0.9057 | 0.7042 | 0.713 | 0.5894 | 0.7292 | 0.6742 | 0.7207 |
WOEEncoder | 0.8393 | 0.9294 | 0.8325 | 0.7599 | 0.6801 | 0.9056 | 0.7172 | 0.8391 | 0.5903 | 0.7279 | 0.6737 | 0.7224 |
Table 1.4 ROC AUC scores for Double Validation
telecom | adult | employee | credit | mortgages | promotion | kick | kdd_upselling | taxi | poverty_A | poverty_B | poverty_C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CatBoostEncoder | 0.8394 | 0.9293 | 0.8529 | 0.7592 | 0.6967 | 0.9056 | 0.7899 | 0.8633 | 0.6031 | 0.7418 | 0.6902 | 0.7343 |
FrequencyEncoder | 0.8371 | 0.9221 | 0.5563 | 0.755 | 0.6582 | 0.8749 | 0.7655 | 0.8551 | 0.5657 | 0.6873 | 0.6037 | 0.6961 |
JamesSteinEncoder | 0.8398 | 0.9296 | 0.8489 | 0.7598 | 0.6981 | 0.905 | 0.7901 | 0.8628 | 0.6033 | 0.7412 | 0.6895 | 0.7366 |
LeaveOneOutEncoder | 0.8393 | 0.9295 | 0.8496 | 0.7595 | 0.6963 | 0.9055 | 0.7902 | 0.8635 | 0.602 | 0.7416 | 0.6931 | 0.7345 |
MEstimateEncoder | 0.8405 | 0.9292 | 0.8125 | 0.7597 | 0.6939 | 0.9063 | 0.7881 | 0.863 | 0.5984 | 0.7375 | 0.6801 | 0.7204 |
TargetEncoder | 0.8393 | 0.9294 | 0.8537 | 0.7596 | 0.6954 | 0.9057 | 0.7909 | 0.8643 | 0.6025 | 0.7415 | 0.6903 | 0.7352 |
WOEEncoder | 0.8401 | 0.9294 | 0.824 | 0.7599 | 0.6977 | 0.9041 | 0.7905 | 0.8631 | 0.6011 | 0.7407 | 0.6911 | 0.7345 |
To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in tables 2.1–2.3.
To determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in table 2.4 and 2.5 below.
Table 2.1 Encoders performance scores - None Validation
None Validation | |
---|---|
HelmertEncoder | 0.9517 |
SumEncoder | 0.9434 |
FrequencyEncoder | 0.9176 |
CatBoostEncoder | 0.5728 |
TargetEncoder | 0.5174 |
JamesSteinEncoder | 0.5162 |
OrdinalEncoder | 0.4964 |
WOEEncoder | 0.4905 |
MEstimateEncoder | 0.4501 |
BackwardDifferenceEncoder | 0.4128 |
LeaveOneOutEncoder | 0.0697 |
Table 2.2 Encoders performance scores - Single Validation
Single Validation | |
---|---|
CatBoostEncoder | 0.9726 |
OrdinalEncoder | 0.9694 |
HelmertEncoder | 0.9558 |
SumEncoder | 0.9434 |
WOEEncoder | 0.9326 |
FrequencyEncoder | 0.9315 |
BackwardDifferenceEncoder | 0.9108 |
TargetEncoder | 0.8915 |
JamesSteinEncoder | 0.8555 |
MEstimateEncoder | 0.8189 |
LeaveOneOutEncoder | 0.0729 |
Table 2.3 Encoders performance scores - Double Validation
Double Validation | |
---|---|
JamesSteinEncoder | 0.9918 |
CatBoostEncoder | 0.9917 |
TargetEncoder | 0.9916 |
LeaveOneOutEncoder | 0.9909 |
WOEEncoder | 0.9838 |
MEstimateEncoder | 0.9686 |
FrequencyEncoder | 0.8018 |
Table 2.4 Top score improvement (percent)
None -> Single | Single -> Double | |
---|---|---|
telecom | 0.00 | 0.01 |
adult | 0.02 | -0.03 |
employee | 1.98 | 0.39 |
credit | -0.01 | -0.00 |
mortgages | 0.26 | -0.47 |
promotion | 0.04 | -0.20 |
kick | -0.05 | 0.06 |
kdd_upselling | 0.10 | -0.11 |
taxi | 3.78 | -0.01 |
poverty_A | 0.74 | -0.11 |
poverty_B | 5.59 | 0.29 |
poverty_C | 0.48 | -0.54 |
Table 2.5 Encoders performance scores improvement (percent)
None -> Single | Single -> Double | |
---|---|---|
BackwardDifferenceEncoder | 27.20 | |
CatBoostEncoder | 20.10 | 0.40 |
FrequencyEncoder | 0.30 | -4.90 |
HelmertEncoder | 0.20 | |
JamesSteinEncoder | 17.70 | 6.30 |
LeaveOneOutEncoder | 0.20 | 53.20 |
MEstimateEncoder | 18.90 | 8.10 |
OrdinalEncoder | 24.10 | |
SumEncoder | 0.00 | |
TargetEncoder | 19.60 | 4.20 |
WOEEncoder | 23.40 | 1.90 |