Example 08_join_aggregation broken
Vincent-Maladiere opened this issue · 4 comments
Describe the issue linked to the documentation
On the stable and dev version of the doc, in the "Hyper-parameters tuning and cross validation" section of the 08_join_aggregation
example, the GridSearchCV outputs scores are all nan.
This is due to the estimator failing during the CV.
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- index__skrub_f3c63e8c__
Feature names seen at fit time, yet now missing:
- index__skrub_634644a2__
It appears that a column name change during one of the preprocessing steps of the pipeline between train and predict breaks the _check_feature_names
.
Suggest a potential alternative/fix
I need more time to understand which preprocessor led to this column name change.
Here is a minimal reproducer:
>>> import pandas as pd
>>> from skrub import AggTarget
>>> df = pd.DataFrame(dict(a=[1, 1, 2, 2], b=[10, 20, 30, 40]))
>>> y = pd.Series([1, 2, 3, 4], name="b")
>>> transformer_1 = AggTarget(main_key='a', operation='mean')
>>> transformer_2 = AggTarget(main_key='a', operation='mean')
>>> out = transformer_1.fit_transform(df, y)
>>> out = transformer_2.fit_transform(out, y)
>>> out
a b b_mean_target b_mean_target__skrub_38870b21__
0 1 10 1.5 1.5
1 1 20 1.5 1.5
2 2 30 3.5 3.5
3 2 40 3.5 3.5
Note the name of the last column changes
>>> transformer_2.transform(transformer_1.transform(df))
a b b_mean_target b_mean_target__skrub_6f32724b__
0 1 10 1.5 1.5
1 1 20 1.5 1.5
2 2 30 3.5 3.5
3 2 40 3.5 3.5
>>> transformer_2.transform(transformer_1.transform(df))
a b b_mean_target b_mean_target__skrub_585a8372__
0 1 10 1.5 1.5
1 1 20 1.5 1.5
2 2 30 3.5 3.5
3 2 40 3.5 3.5
This is due to AggTarget
calling left_join
in transform
with duplicate column names in the dataframe it joins:
Line 442 in 7341c66
left_join
is a stateless low-level function and it appends a random string to duplicated column names to avoid errors. To avoid the name changing at each transform, AggTarget
should either forbid duplicate column names and raise an error (as the Joiner does) or store the column names during fit_transform
and apply them at the end of transform
.
we should also check if the AggJoiner has the same issue
skrub._dataframe._pandas.aggregate
seems to be the function that inserts the "index" column in what becomes AggTarget.y_
, probably due to drop=False
here
Thanks for pointing this out @Vincent-Maladiere. Just checked -- we have the same issue in the AggJoiner
, as both are using _join_utils.left_join
but dont forbid duplicate column names or store deduplicated names
>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
... "airportId": [1, 2],
... "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
... "flightId": range(1, 7),
... "airportId": [1, 1, 1, 2, 2, 2],
... "total_passengers": [90, 120, 100, 70, 80, 90],
... })
>>> agg_joiner_1 = AggJoiner(
... aux_table=aux,
... key="airportId",
... cols=["total_passengers"],
... operations=["mean"],
... )
>>> agg_joiner_2 = AggJoiner(
... aux_table=aux,
... key="airportId",
... cols=["total_passengers"],
... operations=["mean"],
... )
>>> out = agg_joiner_1.fit_transform(main)
>>> out = agg_joiner_2.fit_transform(out)
>>> out
airportId airportName total_passengers_mean total_passengers_mean__skrub_3c2fc647__
0 1 Paris CDG 103.333333 103.333333
1 2 NY JFK 80.000000 80.000000
>>> agg_joiner_2.transform(agg_joiner_1.transform(main))
airportId airportName total_passengers_mean total_passengers_mean__skrub_18f488d3__
0 1 Paris CDG 103.333333 103.333333
1 2 NY JFK 80.000000 80.000000