TableVectorizer imputing logic is confusing

Question

TableVectorizer imputing logic is confusing

Closed this issue 9 months ago · 5 comments

Vincent-Maladiere commented 9 months ago

While working on #819, I found that the imputing behavior we currently use in the TableVectorizer.auto_cast for categorical is intriguing.

1. Two ways imputing

As stated by this comment, we currently:

Replace the "almost missing" strings with np.nan for all columns dtypes
Then, for non-numeric dtypes (string, categorical, and object), we do the opposite: we replace np.nan with the string "missing"

Why do we need to replace np.nan with "missing"?

2. Replacing non-numeric by `np.nan` when trained on numeric

When trained on a numerical column, TableVectorizer replaces the string/object dtypes of this column with np.nan during predict:

import numpy as np
import pandas as pd
from skrub import TableVectorizer

df_train = pd.DataFrame(dict(a=[np.nan, 1]))
df_test = pd.DataFrame(dict(a=["a", "b"]))

tv = TableVectorizer().fit(df_train)
tv.transform(df_test)
# /Users/vincentmaladiere/INRIA/skrub/skrub/_table_vectorizer.py:881: UserWarning:
# Value 'a' could not be converted to inferred type float64 in column 'a'.
# Such values will be replaced by NaN.
#  X = self._apply_cast(X)
#  array([[nan],
#       [nan]])

This is dangerous because it might silence critical errors for the user.
We either need to keep the data and let downstream estimators raise an error or raise it ourselves.

Answer 1 · 2023-11-23T13:01:36.000Z

Why do we need to replace np.nan with "missing"?

For the GapEncoder, missing values should probably be all encoded as vectors of zeros.

2. Replacing non-numeric by np.nan when trained on numeric

Not sure to avoid this. We could raise an error, but by default, I think that we should warn and have an option (set in the init) to enable error at predict.

This is dangerous because it might silence critical errors for the user.

Right, but erroring at predict can mean crashing the production (or prediction :D) server

We either need to keep the data and let downstream estimators raise an error or raise it ourselves.

We cannot: the dtype does not allow this.

Answer 2 · 2023-11-24T14:18:13.000Z

For the GapEncoder, missing values should probably be all encoded as vectors of zeros.

Does this imply letting the transformers deal with the missing categorical values themselves?

Right, but erroring at predict can mean crashing the production (or prediction :D) server

I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.

We cannot: the dtype does not allow this.

What do you mean? Since we trigger a warning, we could raise an error, couldn't we?

Answer 3 · 2023-11-24T14:23:26.000Z

Does this imply letting the transformers deal with the missing categorical values themselves?

I think it does: some transformers can deal with missing values naturally.

I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.

skrub.set_config, a la sklearn.set_config (which can be used as a context manager, an important detail)

We cannot: the dtype does not allow this. What do you mean? Since we trigger a warning, we could raise an error, couldn't we?

What I meant is that we cannot pass through strings without modifying.

Answer 4 · 2023-11-27T10:39:23.000Z

After an IRL meeting, we decided to:

Remove the imputing logic entirely
Leave the "Replacing non-numeric by np.nan when trained on numeric" issue as it is, and see in the medium term what is the most robust configuration for a scikit-learn pipeline. Our objective is to have an option where the pipeline never crashes.

Answer 5 · 2023-12-12T14:53:00.000Z

Solved by #819