skrub-data/skrub

TableVectoriser's "numerical_transformer" does not accept Pipelines

Closed this issue · 3 comments

Describe the bug

As per the Documentation of TableVectoriser here:

Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns (default).

So i would assume that i can pass a pipeline.

Steps/Code to Reproduce

from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

from skrub import TableVectorizer

# get data
cancer = load_breast_cancer(return_X_y = True, as_frame = True)
X = cancer[0]
y = cancer[1]


# Numerical transformer. No NAN in the data but it could be any pipeline
num_prep = make_pipeline(SimpleImputer(add_indicator = True), 
                         StandardScaler())


#TableVectoriser
encoder = TableVectorizer(numerical_transformer = num_prep)


# Model
clf = make_pipeline(encoder, LogisticRegression())
clf.fit(X, y)```

### Expected Results

Should fit the data

### Actual Results

ValueError: 'transformer' must be an instance of sklearn.base.TransformerMixin, 'remainder' or 'passthrough'. Got transformer=Pipeline(steps=[('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler())]).

### Versions

```shell
System:
    python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]
executable: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/bin/python
   machine: macOS-14.3-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/lib/libopenblas.0.dylib
        version: 0.3.26
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/lib/libomp.dylib
        version: None
0.1.0

thanks a lot for reporting this! We'll make sure to address it in #877

here is a reproducer, to be added to our test suite:

import pandas as pd
from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline

df = pd.DataFrame(dict(a=[1.1, 2.2]))
tv = TableVectorizer(numerical_transformer=make_pipeline('passthrough'))
tv.fit(df)

fixed by #902