CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

numpy==1.15.1
pandas==0.23.4
sklearn==0.20.3
category_encoders==2.0.0
lightgbm==2.2.3

Benchmark the dataset

To benchmark endoers for your dataset:

Install libraries in requirements
Process the dataset as in prepare_datasets.ipynb
Add name of the dataset in dataset_list in run_experiment.py
python run_experiment.py
Run show_results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

Name	Total points	Train points	Test points	Number of features	Number of categorical features	Short description
Telecom	7.0k	4.2k	2.8k	20	16	Churn prediction for telecom data
Adult	48.8k	29.3k	19.5k	15	8	Predict if persons' income is bigger 50k
Employee	32.7k	19.6k	13.1k	10	9	Predict an employee's access needs, given his/her job role
Credit	307.5k	184.5k	123k	121	18	Loan repayment
Mortgages	45.6k	27.4k	18.2k	20	9	Predict if house mortgage is founded
Promotion	54.8	32.8k	21.9k	13	5	Predict if an employee will get a promotion
Kick	72.9k	43.7k	29.1k	32	19	Predict if a car purchased at auction is good/bad buy
Kdd_upselling	50k	30k	20k	230	40	Predict up-selling for a customer
Taxi	892.5k	535.5k	357k	8	5	Predict the probability of an offer being accepted by a certain driver
Poverty_A	37.6k	22.5k	15.0k	41	38	Predict whether or not a given household for a given country is poor or not
Poverty_B	20.2k	12.1k	8.1k	224	191	Predict whether or not a given household for a given country is poor or not
Poverty_C	29.9k	17.9k	11.9k	41	35	Predict whether or not a given household for a given country is poor or not

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
BackwardDifferenceEncoder	0.6454	0.8555	0.5006	0.7442	0.5997	0.6482				0.5149	0.5484	0.4945
CatBoostEncoder	0.7666	0.868	0.5004	0.7478	0.6279	0.7811	0.6583	0.8549	0.5477	0.5179	0.5638	0.5427
FrequencyEncoder	0.8405	0.9291	0.807	0.7593	0.6949	0.9052	0.7907	0.8643	0.5656	0.7276	0.6164	0.7177
HelmertEncoder	0.8404	0.9297	0.83	0.7601	0.7001	0.9079				0.7325	0.6343	0.7168
JamesSteinEncoder	0.7195	0.8688	0.5003	0.7485	0.6049	0.7984	0.6592	0.8516	0.5432	0.4918	0.5304	0.4836
LeaveOneOutEncoder	0.5	0.5214	0.6233	0.4957	0.5	0.5457	0.5027	0.5	0.5	0.5006	0.5002	0.4527
MEstimateEncoder	0.6944	0.8617	0.4998	0.7368	0.6086	0.8156	0.653	0.8448	0.5091	0.5254	0.434	0.4528
OrdinalEncoder	0.7409	0.8616	0.501	0.7445	0.6008	0.7124	0.6531	0.8448	0.5498	0.473	0.4683	0.5611
SumEncoder	0.8404	0.929	0.8053	0.7593	0.6944	0.9073				0.7355	0.6206	0.7372
TargetEncoder	0.7195	0.8696	0.5003	0.7483	0.6064	0.7971	0.6594	0.8483	0.5428	0.4955	0.5401	0.4751
WOEEncoder	0.7056	0.8645	0.5012	0.7439	0.615	0.7345	0.6398	0.844	0.5485	0.478	0.5356	0.4671

Table 1.3 ROC AUC scores for Single Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
BackwardDifferenceEncoder	0.8382	0.9293	0.7569	0.7595	0.6894	0.9064				0.7323	0.6151	0.7108
CatBoostEncoder	0.8392	0.9292	0.8498	0.7594	0.6951	0.8918	0.7901	0.8654	0.5844	0.7429	0.6902	0.7333
FrequencyEncoder	0.8392	0.9293	0.8138	0.7592	0.6937	0.9055	0.7902	0.8634	0.582	0.7302	0.6128	0.7195
HelmertEncoder	0.8404	0.9297	0.8344	0.7597	0.7027	0.9083				0.7297	0.6374	0.7196
JamesSteinEncoder	0.8388	0.9292	0.7817	0.7597	0.667	0.9053	0.5835	0.726	0.5898	0.7303	0.6764	0.7217
LeaveOneOutEncoder	0.5	0.5182	0.6121	0.4997	0.5	0.5403	0.4682	0.5	0.5	0.5103	0.5	0.4959
MEstimateEncoder	0.8394	0.929	0.7353	0.7593	0.6957	0.9054	0.5877	0.5953	0.5946	0.7302	0.6493	0.7076
OrdinalEncoder	0.8404	0.9299	0.8274	0.7585	0.6917	0.9078	0.7809	0.8465	0.6034	0.7337	0.6635	0.742
SumEncoder	0.8404	0.929	0.8053	0.7593	0.6944	0.9073				0.7355	0.6206	0.7372
TargetEncoder	0.8388	0.9293	0.815	0.7599	0.6702	0.9057	0.7042	0.713	0.5894	0.7292	0.6742	0.7207
WOEEncoder	0.8393	0.9294	0.8325	0.7599	0.6801	0.9056	0.7172	0.8391	0.5903	0.7279	0.6737	0.7224

Table 1.4 ROC AUC scores for Double Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
CatBoostEncoder	0.8394	0.9293	0.8529	0.7592	0.6967	0.9056	0.7899	0.8633	0.6031	0.7418	0.6902	0.7343
FrequencyEncoder	0.8371	0.9221	0.5563	0.755	0.6582	0.8749	0.7655	0.8551	0.5657	0.6873	0.6037	0.6961
JamesSteinEncoder	0.8398	0.9296	0.8489	0.7598	0.6981	0.905	0.7901	0.8628	0.6033	0.7412	0.6895	0.7366
LeaveOneOutEncoder	0.8393	0.9295	0.8496	0.7595	0.6963	0.9055	0.7902	0.8635	0.602	0.7416	0.6931	0.7345
MEstimateEncoder	0.8405	0.9292	0.8125	0.7597	0.6939	0.9063	0.7881	0.863	0.5984	0.7375	0.6801	0.7204
TargetEncoder	0.8393	0.9294	0.8537	0.7596	0.6954	0.9057	0.7909	0.8643	0.6025	0.7415	0.6903	0.7352
WOEEncoder	0.8401	0.9294	0.824	0.7599	0.6977	0.9041	0.7905	0.8631	0.6011	0.7407	0.6911	0.7345

Results

To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in tables 2.1–2.3.

To determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in table 2.4 and 2.5 below.

Table 2.1 Encoders performance scores - None Validation

	None Validation
HelmertEncoder	0.9517
SumEncoder	0.9434
FrequencyEncoder	0.9176
CatBoostEncoder	0.5728
TargetEncoder	0.5174
JamesSteinEncoder	0.5162
OrdinalEncoder	0.4964
WOEEncoder	0.4905
MEstimateEncoder	0.4501
BackwardDifferenceEncoder	0.4128
LeaveOneOutEncoder	0.0697

Table 2.2 Encoders performance scores - Single Validation

	Single Validation
CatBoostEncoder	0.9726
OrdinalEncoder	0.9694
HelmertEncoder	0.9558
SumEncoder	0.9434
WOEEncoder	0.9326
FrequencyEncoder	0.9315
BackwardDifferenceEncoder	0.9108
TargetEncoder	0.8915
JamesSteinEncoder	0.8555
MEstimateEncoder	0.8189
LeaveOneOutEncoder	0.0729

Table 2.3 Encoders performance scores - Double Validation

	Double Validation
JamesSteinEncoder	0.9918
CatBoostEncoder	0.9917
TargetEncoder	0.9916
LeaveOneOutEncoder	0.9909
WOEEncoder	0.9838
MEstimateEncoder	0.9686
FrequencyEncoder	0.8018

Table 2.4 Top score improvement (percent)

	None -> Single	Single -> Double
telecom	0.00	0.01
adult	0.02	-0.03
employee	1.98	0.39
credit	-0.01	-0.00
mortgages	0.26	-0.47
promotion	0.04	-0.20
kick	-0.05	0.06
kdd_upselling	0.10	-0.11
taxi	3.78	-0.01
poverty_A	0.74	-0.11
poverty_B	5.59	0.29
poverty_C	0.48	-0.54

Table 2.5 Encoders performance scores improvement (percent)