scikit-learn-contrib/category_encoders

CountEncoder returning categories instead of floats/ints after transform

glevv opened this issue · 2 comments

glevv commented
import numpy as np
import pandas as pd
from category_encoders import CountEncoder

X = pd.DataFrame({
'some_cat': ['W', 'L', 'W', 'W', 'L'], 
'some_num': np.random.normal(size=5)}, 
columns=['some_cat', 'some_num'])
X['some_cat'] = X['some_cat'].astype('category')

ce = CountEncoder(cols=['some_cat'])
Xt = ce.fit_transform(X)
Xt.info()

and the outputs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   some_cat  5 non-null      category
 1   some_num  5 non-null      float64 
dtypes: category(1), float64(1)
memory usage: 269.0 bytes
some_cat 	some_num
0 	3 	0.660804
1 	2 	-0.150932
2 	3 	-1.044160
3 	3 	0.115020
4 	2 	-0.035625

It breaks sklearn pipelines, since estimators cannot work with 'category' dtype.
Tested on laptop (category_encoders 2.2.2, pandas 1.2.4) and colab (category_encoders 2.2.2, pandas 1.1.5).

should be fixed by #336 which will be released with the next release.

On pandas 1.4.0 I get

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   some_cat  5 non-null      int64  
 1   some_num  5 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 208.0 bytes

fixed in version 2.4.0. Closing the issue