BCG-X-Official/sklearndf

OneHotEncoderDF: OneHot encoder wrapper fails for columns reduction options

Closed this issue · 0 comments

Describe the bug
The wrapper for the OneHot encoder fails for columns reduction options (drop= "if_binary" or "first")

The wrapper automatically computes the expected columns length of the transformed dataset without taking into account the drop option

To Reproduce
Steps to reproduce the behavior:

  1. open a notebook
  2. Run the following code
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearndf.pipeline import PipelineDF
from sklearndf.transformation import (
    ColumnTransformerDF,
    OneHotEncoderDF,
    SimpleImputerDF,
)
X_churn : pd.DataFrame = ...
y_churn : pd.Series = ...
<img width="1088" alt="Screenshot 2021-02-16 at 16 09 49" src="https://user-images.githubusercontent.com/32160831/108081572-6733de80-7071-11eb-8bca-f52932a4173e.png">

# For categorical features we will use the mode as the imputation value and also one-hot encode
preprocessing_categorical = PipelineDF(
    steps=[
        ("imputer", SimpleImputerDF(strategy="most_frequent", fill_value="<na>")),
        ("one-hot", OneHotEncoderDF(sparse=False, drop="if_binary")),
    ]
)

# For numeric features we will impute using the median
preprocessing_numerical = SimpleImputerDF(strategy="median")

# Put the pipeline together
preprocessing_features = ColumnTransformerDF(
    transformers=[
        (
            "categorical",
            preprocessing_categorical,
            make_column_selector(dtype_include=object),
        ),
        (
            "numerical",
            preprocessing_numerical,
            make_column_selector(dtype_include=np.number),
        ),
    ]
)

# Run the preprocessing
transformed_features = preprocessing_features.fit_transform(X=X_churn, y=y_churn)
transformed_features.head()
  1. See error

Screenshot 2021-02-16 at 16 16 22

Expected behavior
Expected to see the transformed dataset with only one column for categorical columns that have only 2 unique values

  • Version: sklearndf==1.0.1