scikit-learn-contrib/category_encoders

category encoder not "fitting" on categorical (pandas) columns

jyk4100 opened this issue · 1 comments

Reference: issue report on sklearn

Setup

loading pkgs

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import category_encoders

load ca housing data for example and setting features and target

housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.data
nona = df.isna().sum(axis=0)
df = df[nona.index[nona==0]].copy()
df['target'] = (df['SaleCondition'] == "Normal").astype(np.int32)
df['target'].value_counts()

Expected Behavior

cat_cols = ['MSSubClass','OverallQual','OverallCond', 'MoSold', 'GarageCars']
for col in cat_cols:
    df[col] = df[col].astype(np.int64)
df[cat_cols].dtypes ## category
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes 

le = category_encoders.LeaveOneOutEncoder(cols=cat_cols)
le.get_params()
X_train_tr = le.fit_transform(X=X_train, y=y_train)
X_test_tr = le.transform(X_test)

When we get pass df with integer columns to the category_encoder, it return array/df of float that are computed based on relevant encoder logic

X_train_tr.iloc[0:2]
      MSSubClass  OverallQual  OverallCond    MoSold  GarageCars
1216    0.777778      0.87372     0.759939  0.834783    0.844411
339     0.796296      0.87372     0.882716  0.887755    0.884354
X_test_tr.iloc[0:2]
     MSSubClass  OverallQual  OverallCond    MoSold  GarageCars
854    0.796767     0.861635     0.886364  0.830601    0.844646
381    0.796767     0.778210     0.760305  0.742574    0.844646

Actual Behavior

cat_cols = ['MSSubClass','OverallQual','OverallCond', 'MoSold', 'GarageCars']
for col in cat_cols:
    df[col] = df[col].astype('category')
df[cat_cols].dtypes ## category
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes ## category

le = category_encoders.LeaveOneOutEncoder(cols=cat_cols)
le.get_params()
X_train_tr = le.fit_transform(X=X_train, y=y_train)
X_test_tr = le.transform(X_test)
X_train_tr.iloc[0:2]
X_test_tr.iloc[0:2]

we get a df of float but notice all values are same. X_train_tr['MSSubClass'].value_counts() give one unique value and the encoder is "fitted" in that transform doesn't produce expected result.

X_train_tr.iloc[0:2]
      MSSubClass  OverallQual  OverallCond    MoSold  GarageCars
1216    0.821918     0.821918     0.821918  0.821918    0.821918
339     0.821918     0.821918     0.821918  0.821918    0.821918
X_test_tr.iloc[0:2]
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Specifications

System:
python: 3.8.10  [GCC 7.5.0]
executable: ~/miniconda3/envs/ct38/bin/python
machine: Linux-4.11.0-14-generic-x86_64-with-glibc2.17

pip: 22.0.4
setuptools: 61.2.0
numpy: 1.21.6
scipy: 1.8.0
Cython: None
pandas: 1.4.0
category_encoders: 2.4.0

Hi @jyk4100

thanks for this detailed bug report. I managed to find the issue and created a pull request for it that I'll merge.