skrub-data/skrub

`SingleColumnTransformer`s don't work with `ColumnTransformer`

Closed this issue · 4 comments

Describe the bug

On the dev version, both GapEncoder and MinHashEncoder work well with the TableVectorizer, but fail with the ColumnTransformer from scikit-learn.

Steps/Code to Reproduce

from sklearn.compose import make_column_transformer
from skrub.datasets import fetch_employee_salaries
from skrub import GapEncoder

bunch = fetch_employee_salaries()

transformer = make_column_transformer(
    (GapEncoder(), bunch.X.select_dtypes("object").columns)
)
transformer.fit(bunch.X)

Expected Results

No error is thrown

Actual Results

ValueError: GapEncoder.fit should be passed a single column, not a dataframe. GapEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer.

Versions

System:
    python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]
executable: /Users/vincentmaladiere/miniforge3/envs/sandbox/bin/python3.12
   machine: macOS-14.0-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincentmaladiere/miniforge3/envs/sandbox/lib/python3.12/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincentmaladiere/miniforge3/envs/sandbox/lib/python3.12/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.27
threading_layer: pthreads
   architecture: neoversen1

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/vincentmaladiere/miniforge3/envs/sandbox/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
0.2.0.dev0

that's expected if you pass them a dataframe (several columns in the ColumnTransformer). They will work if you pass a singlecolumn (GapEncoder(), "column_name") -- as scikit-learn's TfidfVectorizer for example.

that was a deliberate decision we made with @GaelVaroquaux . our reasoning is that with the TableVectorizer and the upcoming pipeline builder, skrub users are not expected to need the ColumnTransformer anymore.

the equivalent pipeline would look something like

from skrub.datasets import fetch_employee_salaries
from skrub import GapEncoder
from skrub import PipeBuilder
from skrub import selectors as s

bunch = fetch_employee_salaries()

transformer = PipeBuilder().apply(GapEncoder(), cols=s.string()).get_pipeline()
transformer.fit_transform(bunch.X)

or for this dataset you may want the TableVectorizer because the GapEncoder with default parameters might not work on the "gender" column I think so the cardinality threshold could be useful. With the skrub pipeline you could also do somehting like cols=s.string() - 'gender'

I can expand the error message to say that when used in the ColumnTransformer, the corresponding columns should be a single column

Yes, right now the error message is plain wrong, as you can't use the GapEncoder with multiple columns in the ColumnTransformer.