skrub-data/skrub

Example 08_join_aggregation broken

Vincent-Maladiere opened this issue · 4 comments

Describe the issue linked to the documentation

On the stable and dev version of the doc, in the "Hyper-parameters tuning and cross validation" section of the 08_join_aggregation example, the GridSearchCV outputs scores are all nan.

This is due to the estimator failing during the CV.

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- index__skrub_f3c63e8c__
Feature names seen at fit time, yet now missing:
- index__skrub_634644a2__

It appears that a column name change during one of the preprocessing steps of the pipeline between train and predict breaks the _check_feature_names .

Suggest a potential alternative/fix

I need more time to understand which preprocessor led to this column name change.

Here is a minimal reproducer:

>>> import pandas as pd

>>> from skrub import AggTarget

>>> df = pd.DataFrame(dict(a=[1, 1, 2, 2], b=[10, 20, 30, 40]))
>>> y = pd.Series([1, 2, 3, 4], name="b")

>>> transformer_1 = AggTarget(main_key='a', operation='mean')
>>> transformer_2 = AggTarget(main_key='a', operation='mean')
>>> out = transformer_1.fit_transform(df, y)
>>> out = transformer_2.fit_transform(out, y)
>>> out
   a   b  b_mean_target  b_mean_target__skrub_38870b21__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

Note the name of the last column changes

>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_6f32724b__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5


>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_585a8372__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

This is due to AggTarget calling left_join in transform with duplicate column names in the dataframe it joins:

return _join_utils.left_join(

left_join is a stateless low-level function and it appends a random string to duplicated column names to avoid errors. To avoid the name changing at each transform, AggTarget should either forbid duplicate column names and raise an error (as the Joiner does) or store the column names during fit_transform and apply them at the end of transform.

we should also check if the AggJoiner has the same issue

skrub._dataframe._pandas.aggregate seems to be the function that inserts the "index" column in what becomes AggTarget.y_, probably due to drop=False here

Thanks for pointing this out @Vincent-Maladiere. Just checked -- we have the same issue in the AggJoiner, as both are using _join_utils.left_join but dont forbid duplicate column names or store deduplicated names

>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "airportId": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
... })
>>> agg_joiner_1 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> agg_joiner_2 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> out = agg_joiner_1.fit_transform(main)
>>> out = agg_joiner_2.fit_transform(out)
>>> out
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_3c2fc647__  
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000

>>> agg_joiner_2.transform(agg_joiner_1.transform(main))
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_18f488d3__
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000