skrub-data/skrub

Basic regression problem raises exception on inference

Closed this issue · 4 comments

Describe the bug

Fitting a model on a basic regression problem (the Boston house prices dataset) runs, but inferencing with that model on new data raises an exception.

Steps/Code to Reproduce

import sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVR
from skrub import TableVectorizer

X, y = sklearn.datasets.fetch_openml("boston", version=1, return_X_y=True, as_frame=True, parser="auto")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
p = make_pipeline(TableVectorizer(), SVR())
p.fit(X_train, y_train)
p.score(X_test, y_test)

Expected Results

No exception.

Actual Results

    105 def _union_category(X_col, dtype):
    106     """Update a categorical dtype with new entries."""
--> 107     known_categories = dtype.categories
    108     new_categories = pd.unique(X_col.loc[X_col.notnull()])
    109     dtype = pd.CategoricalDtype(categories=known_categories.union(new_categories))

AttributeError: 'numpy.dtypes.Int64DType' object has no attribute 'categories'

Versions

0.1.0

Swapping these two lines in _table_vectorizer.py seems to solve the problem, though I'm not familiar enough with Skrub to validate that this is the correct solution:

X = _to_numeric(X)
self.inferred_column_types_ = X.dtypes.to_dict()

Thank you for reporting this problem! I can reproduce it.
The TableVectorizer is undergoing some refactoring and improvement in #848 , we'll make sure to add a test to ensure this problem is fixed.
I don't think we want to just swap those 2 lines because we do want to apply the conversion to numerical dtypes for the appropriate columns, but ensuring we apply the correct and consistent transformations in fit and transform is a goal of #848

Hi @lsorber,
We're prioritizing fixing this problem. It's due to the inference of which transformation gets applied to which column, and in particular the fact that the inference must be cater for train and test type.
Solving this in a clean way (not one that moves one problem from one end to another) requires a refactoring, and so I suspect that it will take a couple of weeks, plus the time to merge and make a release. So give it a month before we hope to release a fix

here is a minimal reproducer, to add as a test:

import pandas as pd
from skrub import TableVectorizer

df = pd.DataFrame(dict(a=pd.Series(['0', '1'], dtype='category')))
TableVectorizer().fit(df).transform(df)