scikit-learn-contrib/category_encoders

pd.NA should behave as np.nan

tvdboom opened this issue · 5 comments

Expected Behavior

pd.NA should behave the same as np.nan and be returned when handle_missing="return_nan".

Actual Behavior

pd.NA is treated like an other category.

Steps to Reproduce the Problem

from category_encoders.target_encoder import TargetEncoder

TargetEncoder(handle_missing="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

returns

          0
0  0.579928
1  0.710036
2  0.666667

instead of

          0
0  0.579928
1  0.710036
2       <NA>

You just need to add this argument "handle_unknown="return_nan":

TargetEncoder(handle_missing="return_nan", handle_unknown="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

That's not the same. I want unknown values to return the target mean, like handle_unknown="value" does, and missing values return missing. Also, your code returns np.nan, and not pd.NA. It would be better if the returned NA type is the same as the input one.

You can use Numpy :

Your data :

data = [["a"], ["b"], [pd.NA]]
y = [0, 1, 1]

Replace pd.NA with np.nan :

data = [[val if not pd.isna(val) else np.nan for val in row] for row in data]

Apply TargetEncoder :

encoder = TargetEncoder(handle_missing="return_nan")
encoded_data = encoder.fit_transform(data, y)

Convert the result back to pd.NA where np.nan is present :

encoded_data = pd.DataFrame([[pd.NA if pd.isna(val) else val for val in row] for row in encoded_data.values], columns=encoded_data.columns)

print(encoded_data)

I hope I was able to help you

Thanks, but what I am looking for is a change in the library, to have a structural implementation, and not an adhoc solution

agreed! this should be changed. Do you want to create a PR?