scikit-learn-contrib/category_encoders

CatBoostEncoder mapping back the categories

mirix opened this issue · 1 comments

mirix commented

Expected Behavior

I have not found a function to map the encoded values back to the categorical values when using category_encoders' CatBoostEncoder.

I was trying to do it manually by using the following equation (TargetSum + Prior) / (FeatureCount + 1)

I am using the following example :

https://www.geeksforgeeks.org/categorical-encoding-with-catboost-encoder/

And the following code:

cbe_encoder = ce.cat_boost.CatBoostEncoder()
mapp = cbe_encoder.fit(train, target)

prior = target['grade'].sum() / len(train)
color = mapp.mapping.get('color').reset_index()
color['encoder'] = ( color['sum'] + prior ) / ( color['count'] + 1 )

Same for the column 'interests'.

color interests height grade
0 red sketching 68 1
1 blue painting 64 2
2 blue instruments 87 3
3 green sketching 45 2
4 red painting 54 3
5 red video games 64 1
6 black painting 67 4
7 black instruments 98 4
8 blue sketching 90 2
9 green sketching 87 3
color interests height
0 1.875 2.100000 68
1 2.375 2.875000 64
2 2.375 3.166667 87
3 2.500 2.100000 45
4 1.875 2.875000 54
5 1.875 2.500000 64
6 3.500 2.875000 67
7 3.500 3.166667 98
8 2.375 2.100000 90
9 2.500 2.100000 87
2.5
index sum count encoder
0 black 8 2 3.500
1 blue 7 3 2.375
2 green 5 2 2.500
3 red 5 3 1.875
index sum count encoder
0 instruments 7 2 3.166667
1 painting 9 3 2.875000
2 sketching 8 4 2.100000
3 video games 1 1 1.750000

Actual Behavior

As you can see all values match except for video games, which is assigned 2.5 by the encoder but applying the equation yields 1.75, which seems the correct value to me.

Or is it the constant different from 1 when there is only one occurrence?

Steps to Reproduce the Problem

  1. Add the code above to the code from the link.
  2. Run it.

Specifications

  • Version: 2.5.0
  • Platform: Linux zboox 5.18.3-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Jun 9 09:54:55 UTC 2022 x86_64 GNU/Linux
  • Subsystem:

Or is it the constant different from 1 when there is only one occurrence?

This is actually the case in the current implementation. This is done do avoid over-fitting. But there is a discussion that we should not have this behaviour but rather manage cases with little sample size via regularization. This is also the case in e.g. in target encoder. We discuss this in issue #327

For reference this is the critical line in the current code: https://github.com/scikit-learn-contrib/category_encoders/blob/master/category_encoders/cat_boost.py#L120=