skrub-data/skrub

Shorthand for getting only the preprocessing part of the TableVectorizer

Opened this issue · 2 comments

Problem Description

Sometimes we may want to apply the preprocessing/cleaning steps of the TableVectorizer (parsing datetimes, handling pandas extension dtypes, etc.), while handling the actual encoding in separate pipeline steps.
This will probably become more relevant when the Recipe (or whatever its name will be) is introduced: we can use it to build exactly the pipeline we want, but we would still like to apply the default cleaning done by the TableVectorizer

If this sounds like a plausible use-case maybe we could have a shorthand for

TableVectorizer(
    high_cardinality_transformer="passthrough",
    low_cardinality_transformer="passthrough",
    datetime_transformer="passthrough",
    numeric_transformer="passthrough",
    specific_transformers=(),
)

maybe

TableSkrubber()

Feature Description

...

Alternative Solutions

No response

Additional Context

No response

some examples of the kind of cleaning the tablevectorizer does:

>>> import pandas as pd
>>> from skrub import TableVectorizer


>>> skrubber = TableVectorizer(
...     high_cardinality_transformer="passthrough",
...     low_cardinality_transformer="passthrough",
...     datetime_transformer="passthrough",
...     numeric_transformer="passthrough",
...     specific_transformers=(),
... )

>>> df = pd.DataFrame({
...     'a': ['2020-01-02', '2020-01-03'],
...     'b': ['2.2', 'nan'],
...     'c': [1.5, pd.NA],
...     'd': [True, False],
...     'e': pd.Series([4.5, 'a'], dtype='category'),
... })
>>> df
            a    b     c      d    e
0  2020-01-02  2.2   1.5   True  4.5
1  2020-01-03  nan  <NA>  False    a
>>> df.dtypes
a      object
b      object
c      object
d        bool
e    category
dtype: object
>>> df['e'].cat.categories
Index([4.5, 'a'], dtype='object')

>>> skrubbed = skrubber.fit_transform(df)
>>> skrubbed
           a    b    c    d    e
0 2020-01-02  2.2  1.5  1.0  4.5
1 2020-01-03  NaN  NaN  0.0    a
>>> skrubbed.dtypes
a    datetime64[ns]
b           float32
c           float32
d           float32
e          category
dtype: object
>>> skrubbed['e'].cat.categories
Index(['4.5', 'a'], dtype='object')

I like the name "Skrubber"