pd.NA should behave as np.nan

Question

pd.NA should behave as np.nan

tvdboom opened this issue a year ago · 5 comments

tvdboom commented a year ago

Expected Behavior

pd.NA should behave the same as np.nan and be returned when handle_missing="return_nan".

Actual Behavior

pd.NA is treated like an other category.

Steps to Reproduce the Problem

from category_encoders.target_encoder import TargetEncoder

TargetEncoder(handle_missing="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

returns

          0
0  0.579928
1  0.710036
2  0.666667

instead of

          0
0  0.579928
1  0.710036
2       <NA>

Answer 1 · 2023-09-25T06:51:02.000Z

You just need to add this argument "handle_unknown="return_nan":

TargetEncoder(handle_missing="return_nan", handle_unknown="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

Answer 2 · 2023-09-25T10:41:15.000Z

That's not the same. I want unknown values to return the target mean, like handle_unknown="value" does, and missing values return missing. Also, your code returns np.nan, and not pd.NA. It would be better if the returned NA type is the same as the input one.

Answer 3 · 2023-09-25T11:50:24.000Z

You can use Numpy :

Your data :

data = [["a"], ["b"], [pd.NA]]
y = [0, 1, 1]

Replace pd.NA with np.nan :

data = [[val if not pd.isna(val) else np.nan for val in row] for row in data]

Apply TargetEncoder :

encoder = TargetEncoder(handle_missing="return_nan")
encoded_data = encoder.fit_transform(data, y)

Convert the result back to pd.NA where np.nan is present :

encoded_data = pd.DataFrame([[pd.NA if pd.isna(val) else val for val in row] for row in encoded_data.values], columns=encoded_data.columns)

print(encoded_data)

I hope I was able to help you

Answer 4 · 2023-09-25T12:33:28.000Z

Thanks, but what I am looking for is a change in the library, to have a structural implementation, and not an adhoc solution

Answer 5 · 2023-09-25T14:54:17.000Z

agreed! this should be changed. Do you want to create a PR?