skrub-data/skrub

TableVectorizer imputing logic is confusing

Closed this issue · 5 comments

While working on #819, I found that the imputing behavior we currently use in the TableVectorizer.auto_cast for categorical is intriguing.

1. Two ways imputing

As stated by this comment, we currently:

  • Replace the "almost missing" strings with np.nan for all columns dtypes
  • Then, for non-numeric dtypes (string, categorical, and object), we do the opposite: we replace np.nan with the string "missing"

Why do we need to replace np.nan with "missing"?

2. Replacing non-numeric by np.nan when trained on numeric

When trained on a numerical column, TableVectorizer replaces the string/object dtypes of this column with np.nan during predict:

import numpy as np
import pandas as pd
from skrub import TableVectorizer

df_train = pd.DataFrame(dict(a=[np.nan, 1]))
df_test = pd.DataFrame(dict(a=["a", "b"]))

tv = TableVectorizer().fit(df_train)
tv.transform(df_test)
# /Users/vincentmaladiere/INRIA/skrub/skrub/_table_vectorizer.py:881: UserWarning:
# Value 'a' could not be converted to inferred type float64 in column 'a'.
# Such values will be replaced by NaN.
#  X = self._apply_cast(X)
#  array([[nan],
#       [nan]])

This is dangerous because it might silence critical errors for the user.
We either need to keep the data and let downstream estimators raise an error or raise it ourselves.

For the GapEncoder, missing values should probably be all encoded as vectors of zeros.

Does this imply letting the transformers deal with the missing categorical values themselves?

Right, but erroring at predict can mean crashing the production (or prediction :D) server

I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.

We cannot: the dtype does not allow this.

What do you mean? Since we trigger a warning, we could raise an error, couldn't we?

After an IRL meeting, we decided to:

  • Remove the imputing logic entirely
  • Leave the "Replacing non-numeric by np.nan when trained on numeric" issue as it is, and see in the medium term what is the most robust configuration for a scikit-learn pipeline. Our objective is to have an option where the pipeline never crashes.