LeaveOneHotEncoder returns wrong output (multiplied class levels) *critical*
agilebean opened this issue · 4 comments
This is IMHO a critical error, as it would falsify all research findings conducted with the LeaveOneHotEncoder.
Expected Behavior
As any other encoder in the category_encoders library, LeaveOneHotEncoder should return a single value as encoding per class level.
Example:
In current dataset, the feature job_type
contains 4 class levels.
This feature is encoded by scikit-Mestimate
into 4 values, namely:
1 58.696997132941 5
2 62.1102766263165 4308
3 64.4396724145027 54
4 66.1381150538009 51
plot target vs. feature (encoded by Mestimate):
Actual Behavior
scikit-loo
converts the same feature job_type
into 94 values, starting with:
[1] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
[12] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
...
[111] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
[122] 64.43967 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
[133] 64.43967 62.11028 62.11028 62.11028 62.11028 66.13812 62.11028 62.11028 62.11028 62.11028 62.11028
The density plot reveals that the encoding is NOT concentrated on 4 distinct values but dispersed.
This is even more evident in the following plot.
plot target vs. feature (encoded by LeaveOneOut):
The previous plot clearly shows the problem:
The LeaveOneOut encoder converts the class levels into a distribution, and then performs the encoding on this distribution. This results into the encoding of the target across the whole range of the distribution, effectively encoding the target by a linear relationship.
Proof:
When extracting the numeric difference of prediction between LeaveOneOut and Mestimate by
jobtype.loo jobtype.mestimate groundtruth delta12
1 62.1 62.1 36 0.00606
2 62.1 62.1 88 -0.00601
3 62.1 62.1 78 -0.00369
4 62.1 62.1 99 -0.00857
5 62.1 62.1 57 0.00119
6 62.1 62.1 33 0.00676
7 62.1 62.1 70 -0.00183
8 62.1 62.1 62 0.0000256
9 62.1 62.1 36 0.00606
10 62.1 62.1 78 -0.00369
this difference delta12
shows an almost exact linear relationship with the target around 0:
Steps to Reproduce the Problem
The same code encoder.fit_transform()
was used to perform the category encoding for for LeaveOneOut and all other encoders, using the trainingset and trainingset target.
The problem only ocurred with the LeaveOneOut encoder, as shown in this plot:
Specifications
-
Version:
category_encoders 2.2.2 -
Platform:
MacOS Big Sur 11.2 -
Subsystem:
Hi @agilebean
This is actually expected behaviour.
On the training set the leave one out encoder does not use the current row for calculating the encoding value. This is in order to avoid overfitting for values of categorical attributes with low cardinality. So for regression problems the encoded values should differ greatly. On other data sets (validation, test, new unseen data) the learned mapping is applied (and you only have one value per category).
Hi @PaulWestenthanner
Thanks for replying.
However, I don't understand your answer. Could you elaborate?
Hi @agilebean This is actually expected behaviour.
Do you mean the linear relationship I showed above?
But that would mean the LOO encoder encodes the target linearly, wouldn't it?
On the training set the leave one out encoder does not use the current row for calculating the encoding value. This is in order to avoid overfitting for values of categorical attributes with low cardinality.
Yes, I understand this principle of the LOO encoder. But the problem I raised here shows that a linear model of the target is encoded. And this explains that far too good performance in the benchmark, clearly because the target was encoded.
Thanks again for replying, but I think the issue I raised is still valid.
Please compare the LOO encoder performance with the other encoders, some of which are also from scikit-learn. It is "too" good, no algorithm can outperform all others by such a magnitude. It is not shown in any research in published ML papers to the best of my knowing.
In conclusion, please re-open the issue.
The linear dependency is just because the average of a set of values changes linearly with a single value. So the value of the encoding should be in a linear relationship to the value left out, right? That's also by design and nothing special. I do agree that this can be a problem with regard to overfitting.
I don't know about your specific data set but I could imagine that your too-good-to-be-true values are due to this relationship AND that fact that one category just outweights the others. You should get low RMSE values if you're perfect for the 4308 data points with the most frequent label and rather bad otherwise. Can you elaborate if this is the caes?
Dear @PaulWestenthanner
your answer is highly appreciated.
So I will check your points about the possibility of a severe class imbalance problem, and will get back to you within a week. Thanks again for this hint.