scikit-learn-contrib/category_encoders

LeaveOneHotEncoder returns wrong output (multiplied class levels) *critical*

agilebean opened this issue · 4 comments

This is IMHO a critical error, as it would falsify all research findings conducted with the LeaveOneHotEncoder.

Expected Behavior

As any other encoder in the category_encoders library, LeaveOneHotEncoder should return a single value as encoding per class level.
Example:
In current dataset, the feature job_type contains 4 class levels.
This feature is encoded by scikit-Mestimate into 4 values, namely:

1 58.696997132941      5
2 62.1102766263165  4308
3 64.4396724145027    54
4 66.1381150538009    51

density plot:
image

plot target vs. feature (encoded by Mestimate):
image

Actual Behavior

scikit-loo converts the same feature job_type into 94 values, starting with:

[1] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
  [12] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
...
 [111] 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
 [122] 64.43967 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028 62.11028
 [133] 64.43967 62.11028 62.11028 62.11028 62.11028 66.13812 62.11028 62.11028 62.11028 62.11028 62.11028

density plot:
image

The density plot reveals that the encoding is NOT concentrated on 4 distinct values but dispersed.
This is even more evident in the following plot.

plot target vs. feature (encoded by LeaveOneOut):
image

The previous plot clearly shows the problem:
The LeaveOneOut encoder converts the class levels into a distribution, and then performs the encoding on this distribution. This results into the encoding of the target across the whole range of the distribution, effectively encoding the target by a linear relationship.

Proof:
When extracting the numeric difference of prediction between LeaveOneOut and Mestimate by

   jobtype.loo jobtype.mestimate groundtruth    delta12
 1        62.1              62.1          36  0.00606  
 2        62.1              62.1          88 -0.00601  
 3        62.1              62.1          78 -0.00369  
 4        62.1              62.1          99 -0.00857  
 5        62.1              62.1          57  0.00119  
 6        62.1              62.1          33  0.00676  
 7        62.1              62.1          70 -0.00183  
 8        62.1              62.1          62  0.0000256
 9        62.1              62.1          36  0.00606  
10        62.1              62.1          78 -0.00369

this difference delta12 shows an almost exact linear relationship with the target around 0:

image

Steps to Reproduce the Problem

The same code encoder.fit_transform() was used to perform the category encoding for for LeaveOneOut and all other encoders, using the trainingset and trainingset target.

The problem only ocurred with the LeaveOneOut encoder, as shown in this plot:

image

Specifications

  • Version:
    category_encoders 2.2.2

  • Platform:
    MacOS Big Sur 11.2

  • Subsystem:

Hi @agilebean
This is actually expected behaviour.
On the training set the leave one out encoder does not use the current row for calculating the encoding value. This is in order to avoid overfitting for values of categorical attributes with low cardinality. So for regression problems the encoded values should differ greatly. On other data sets (validation, test, new unseen data) the learned mapping is applied (and you only have one value per category).

Hi @PaulWestenthanner
Thanks for replying.
However, I don't understand your answer. Could you elaborate?

Hi @agilebean This is actually expected behaviour.
Do you mean the linear relationship I showed above?
But that would mean the LOO encoder encodes the target linearly, wouldn't it?

On the training set the leave one out encoder does not use the current row for calculating the encoding value. This is in order to avoid overfitting for values of categorical attributes with low cardinality.

Yes, I understand this principle of the LOO encoder. But the problem I raised here shows that a linear model of the target is encoded. And this explains that far too good performance in the benchmark, clearly because the target was encoded.

Thanks again for replying, but I think the issue I raised is still valid.
Please compare the LOO encoder performance with the other encoders, some of which are also from scikit-learn. It is "too" good, no algorithm can outperform all others by such a magnitude. It is not shown in any research in published ML papers to the best of my knowing.

In conclusion, please re-open the issue.

The linear dependency is just because the average of a set of values changes linearly with a single value. So the value of the encoding should be in a linear relationship to the value left out, right? That's also by design and nothing special. I do agree that this can be a problem with regard to overfitting.
I don't know about your specific data set but I could imagine that your too-good-to-be-true values are due to this relationship AND that fact that one category just outweights the others. You should get low RMSE values if you're perfect for the 4308 data points with the most frequent label and rather bad otherwise. Can you elaborate if this is the caes?

Dear @PaulWestenthanner
your answer is highly appreciated.
So I will check your points about the possibility of a severe class imbalance problem, and will get back to you within a week. Thanks again for this hint.