Possible SummaryEncoder doc error
glevv opened this issue · 7 comments
Expected Behavior
SummaryEncoder should return N*cat_features columns, where N is the number of quantiles used to describe each category, at least this is stated in the original paper section 2.1
A generalization of the quantile encoder is to compute several features corre-
sponding to different quantiles per each categorical feature, instead of a single
feature
Actual Behavior
Docs example states that SummaryEncoder returns 1*cat_features
>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None
where it should be something like this
>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 17 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS_25 506 non-null float64
CHAS_50 506 non-null float64
CHAS_75 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD_25 506 non-null float64
RAD_50 506 non-null float64
RAD_75 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(17)
memory usage: 51.5 KB
None
You're right. @cmougan this was probably a copy-paste error?
Yes, it's a copy paste issue.
It returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 CHAS_25 506 non-null float64
14 RAD_25 506 non-null float64
15 CHAS_50 506 non-null float64
16 RAD_50 506 non-null float64
17 CHAS_75 506 non-null float64
18 RAD_75 506 non-null float64
Currently you can't use Summary Encoder or Quantile Encoder because they are not yet released.
While there is not a new update of category_encoders
package you can use the implementation that we use on the original paper in pip install sktools
@PaulWestenthanner maybe we could do a package release?
We definitely should release. Unfortunately I do not have the rights to do so...
@PaulWestenthanner who does?
@PaulWestenthanner you should have rights. If you update the version in init.py and the changelog, then go into the releases page of github and draft a new release (tag it with the release number) then the github action should take care of the rest.
Ah, I didn't know that. Sorry that I postponed the release for so long. It worked like charm though. The new version is visible in PyPI. Thanks a lot @wdm0006 !