scikit-learn-contrib/category_encoders

Handle missing in one hot encoder

PaulWestenthanner opened this issue · 3 comments

Expected Behavior

Currently, handle_missing=value adds a new column although the documentation says 'value' will encode a new value as 0 in every dummy column.
Furthermore, we need a test for this

Actual Behavior

adds a column instead of using all 0

Steps to Reproduce the Problem

from category_encoders import OneHotEncoder
import pandas as pd

he = OneHotEncoder(handle_missing="value")

data = [("foo", 1), ("bar", 2), (None, 6)]
data = pd.DataFrame(data, columns=["c1", "c2"])
print(he.fit_transform(data))

Specifications

  • Version: 2.6
  • Platform: linux

Would this replace the new "ignore" from #396?

I would expect this to be the correct behavior; is the added column a longstanding behavior, or perhaps a regression that wasn't caught in testing?

Oh you're right. I missed this when adding the ignore option. Thanks for pointing out.
not sure about the naming though... we have the option value to put in "some value that makes sense" in most encoders. So it makes sense for people familiar with the library, ignore on the other hand is more telling