skrub-data/skrub

GapEncoder and MinHashEncoder modify their input inplace

Closed this issue · 2 comments

Nulls are replaced by empty strings in the original array:

>>> import pandas as pd
>>> from skrub import GapEncoder

>>> encoder = GapEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> encoder.fit(df)
GapEncoder(n_components=2)
>>> df
     a
0  one
1
2  one

As the inputs are strings that need to be processed anyway, it is easy to avoid this in-place modification without making an extra copy of the input.

For example we can change the preprocessor of the Count or Hashing Vectorizer

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]
>>> v = CountVectorizer(analyzer='char')
>>> v.fit_transform(x).A
Traceback (most recent call last):
    ...
AttributeError: 'NoneType' object has no attribute 'lower'

>>> v = CountVectorizer(analyzer='char', preprocessor=lambda x: x if isinstance(x, str) else '')
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])

maybe it would make sense to ask scikit-learn to add an option to let the default preprocessor do this imputation. In the meanwhile I would suggest we do it in skrub to avoid modifying X in place

note the minhashencoder has the same issue

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import MinHashEncoder

>>> encoder = MinHashEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> df.isna()
       a
0  False
1   True
2  False
>>> _ = encoder.fit_transform(df)
>>> df
     a
0  one
1  NAN
2  one
>>> df.isna()
       a
0  False
1  False
2  False

here is a way to still keep the default preprocessing while adding the nan handling; CountVectorizer exposes a public build_preprocessor() function:

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]

>>> default_preprocessor = CountVectorizer().build_preprocessor()
>>> def preprocessor(text):
...     return default_preprocessor(text) if isinstance(text, str) else ''

>>> v = CountVectorizer(analyzer='char', preprocessor=preprocessor)
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])