scikit-learn-contrib/category_encoders

FutureWarning in ordinal encoder when downcasting objects

Opened this issue · 2 comments

Expected Behavior

No FutureWarning is thrown.

Actual Behavior

Currently the following warning is thrown.

category_encoders/ordinal.py:198: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)

Neither suppressing warnings, setting the pandas option or changing the types on caller side is sufficient for correctness.

Steps to Reproduce the Problem

  1. create data frame with object dtype.
  2. fit data frame to CountEncoder (or similar)
  3. notice the warning

Specifications

  • Version: 2.6.3

For what it's worth, these local changes fixed things for me & kept tests passing. If anyone is willing to officialize this it'll be much appreciated.

diff --git a/category_encoders/ordinal.py b/category_encoders/ordinal.py
index 45d333e..94804c0 100644
--- a/category_encoders/ordinal.py
+++ b/category_encoders/ordinal.py
@@ -195,7 +195,7 @@ class OrdinalEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):
 
                 # Convert to object to accept np.nan (dtype string doesn't)
                 # fillna changes None and pd.NA to np.nan
-                X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
+                X[column] = X[column].astype("object").infer_objects(copy=False).fillna(np.nan).map(col_mapping)
                 if util.is_category(X[column].dtype):
                     nan_identity = col_mapping.loc[col_mapping.index.isna()].array[0]
                     X[column] = X[column].cat.add_categories(nan_identity)

Thanks for reporting!

Your proposed fix seems fine, but I wonder whether something else might be better. The cast to object is just there (according to the comment) to accommodate np.nan as the fill, and we're about to map to numeric, so the dtype itself isn't critical information, and downcasting in particular isn't needed. Should we just opt in to the future behavior?