scikit-learn-contrib/category_encoders

Possible SummaryEncoder doc error

glevv opened this issue · 7 comments

glevv commented

Expected Behavior

SummaryEncoder should return N*cat_features columns, where N is the number of quantiles used to describe each category, at least this is stated in the original paper section 2.1

A generalization of the quantile encoder is to compute several features corre-
sponding to different quantiles per each categorical feature, instead of a single
feature

Actual Behavior

Docs example states that SummaryEncoder returns 1*cat_features

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

where it should be something like this

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 17 columns):
CRIM          506 non-null float64
ZN            506 non-null float64
INDUS         506 non-null float64
CHAS_25       506 non-null float64
CHAS_50       506 non-null float64
CHAS_75       506 non-null float64
NOX           506 non-null float64
RM            506 non-null float64
AGE           506 non-null float64
DIS           506 non-null float64
RAD_25        506 non-null float64
RAD_50        506 non-null float64
RAD_75        506 non-null float64
TAX           506 non-null float64
PTRATIO       506 non-null float64
B             506 non-null float64
LSTAT         506 non-null float64
dtypes: float64(17)
memory usage: 51.5 KB
None

You're right. @cmougan this was probably a copy-paste error?

Yes, it's a copy paste issue.

It returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  CHAS_25  506 non-null    float64
 14  RAD_25   506 non-null    float64
 15  CHAS_50  506 non-null    float64
 16  RAD_50   506 non-null    float64
 17  CHAS_75  506 non-null    float64
 18  RAD_75   506 non-null    float64

Currently you can't use Summary Encoder or Quantile Encoder because they are not yet released.
While there is not a new update of category_encoders package you can use the implementation that we use on the original paper in pip install sktools

@PaulWestenthanner maybe we could do a package release?

We definitely should release. Unfortunately I do not have the rights to do so...

@PaulWestenthanner you should have rights. If you update the version in init.py and the changelog, then go into the releases page of github and draft a new release (tag it with the release number) then the github action should take care of the rest.

Ah, I didn't know that. Sorry that I postponed the release for so long. It worked like charm though. The new version is visible in PyPI. Thanks a lot @wdm0006 !