scikit-learn-contrib/category_encoders

Intercept in Contrast Coding Schemes

PaulWestenthanner opened this issue · 7 comments

Expected Behavior

The constant (all values 1) intercept column should not be added when applying contrast coding schemes (i.e. backward difference, sum, polynomial and helmert coding)

I don't think this intercept column is needed. If you fit a supervised learning model it is probably gonna help to remove the intercept column. I think it is there because when fitting linear models with statsmodels you have to add the intercept.
However I don't like that the output of an encoder would then depend on whether the intercept column is already there or not, e.g. if I first apply encoder A on column A and then encoder B on column B the intercept column of B overwrite A's intercept column hence not adding a new column. Also if I have (for some reason) a column called intercept that is not constant it would get overwritten.

Any opinion? Am I missing something? Is the intercept necessary?

Actual Behavior

A constant column with all values 1 is added

Steps to Reproduce the Problem

Run transform on any fitted contrast coding encoder, e.g.

        train = ['A', 'B', 'C']
        encoder = encoders.BackwardDifferenceEncoder(handle_unknown='value', handle_missing='value')
        encoder.fit_transform(train)
glevv commented

Can intercept be added as a class parameter?
If so, then this is the way to go, imo. Then these classes could be tested with different values of intercept to catch errors and bugs.

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

glevv commented

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

Yep, I would set it to True to keep the default behavior intact