skrub-data/skrub

AggJoiner raises exceptions when trying to join multiple tables at once

Closed this issue · 5 comments

Describe the bug

I am unable to use the AggJoiner to execute a join over multiple tables at the same time with different keys.

Given one main table and two aux tables (see example), error checking on the main_key variable prevents me from providing the proper keys.

If I use a list of strings:

main_key = ["key_1", "key_2"]

Then this is interpreted as joining over multiple keys on one table.

If I use a list of list of strings,

main_key = [["key_1"], ["key_2"]]

then the function check_missing_columns in _agg_joiner.py raises an exception because lists are unhashable.

Steps/Code to Reproduce

Setup

import polars as pl
from skrub import AggJoiner

main_table = pl.DataFrame(
    {
        "key_1": [1,2,3,4],
        "key_2": [10,20,30,40],
        }
)
aux_table_1 = pl.DataFrame(
    {
        "key_1": [3,4],
        "key_2": [30,40],
        }
)
aux_table_2 = pl.DataFrame(
    {
        "key_1": [1,2],
        "key_2": [10,20],
        }
)

join_tables = [aux_table_1, aux_table_2]
join_keys = [["key_1"], ["key_1"]]

First exception

main_keys_1 = [["key_1"], ["key_1"]]
aggjoiner = AggJoiner(
    aux_table=join_tables,
    aux_key=join_keys,
    main_key=main_keys_1,
)
joined_table = aggjoiner.fit_transform(main_table)

Second exception

main_keys_2 = ["key_1", "key_1"]
aggjoiner = AggJoiner(
    aux_table=join_tables,
    aux_key=join_keys,
    main_key=main_keys_2,
)
joined_table = aggjoiner.fit_transform(main_table)

Expected Results

Main table should be joined with aux_table_1 and aux_table_2.

Actual Results

Traceback for exception 1:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:9
      [2](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:2) aggjoiner = AggJoiner(
      [3](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:3)     aux_table=join_tables,
      [4](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:4)     aux_key=join_keys,
      [5](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:5)     # TODO: write this properly
      [6](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:6)     main_key=main_keys,
      [7](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:7) )
      [8](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:8) # execute join between X and the candidates
----> [9](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:9) joined_table = aggjoiner.fit_transform(main_table)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    [138](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:138) @wraps(f)
    [139](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:139) def wrapped(self, X, *args, **kwargs):
--> [140](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140)     data_to_wrap = f(self, X, *args, **kwargs)
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:141)     if isinstance(data_to_wrap, tuple):
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:142)         # only wrap the first output for cross decomposition
    [143](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:143)         return (
    [144](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:144)             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    [145](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:145)             *data_to_wrap[1:],
    [146](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:146)         )

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:878, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    [874](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:874) # non-optimized default implementation; override when a better
    [875](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:875) # method is possible for a given clustering algorithm
    [876](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:876) if y is None:
    [877](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:877)     # fit method of arity 1 (unsupervised transformation)
--> [878](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:878)     return self.fit(X, **fit_params).transform(X)
    [879](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:879) else:
    [880](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:880)     # fit method of arity 2 (supervised transformation)
    [881](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:881)     return self.fit(X, y, **fit_params).transform(X)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:196, in AggJoiner.fit(self, X, y)
    [178](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:178) def fit(self, X, y=None):
    [179](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:179)     """Aggregate auxiliary tables based on the main keys.
    [180](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:180) 
    [181](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:181)     Parameters
   (...)
    [194](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:194)         Fitted :class:`AggJoiner` instance (self).
    [195](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:195)     """
--> [196](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:196)     self.check_input(X)
    [197](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:197)     skrub_px, _ = get_df_namespace(*self.aux_table_)
    [199](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:199)     num_operations, categ_operations = split_num_categ_operations(self.operation_)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:272, in AggJoiner.check_input(self, X)
    [270](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:270) # Check main_key
    [271](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:271) error_msg = f"main_key={self.main_key_!r} are not in {X.columns=!r}."
--> [272](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:272) check_missing_columns(X, self.main_key_, error_msg=error_msg)
    [274](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:274) # Check length of table and aux_key
    [275](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:275) if not isinstance(self.aux_table, (list, tuple)):

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:67, in check_missing_columns(X, columns, error_msg)
     [52](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:52) def check_missing_columns(
     [53](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:53)     X,
...
---> [67](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:67)     missing_cols = set(columns) - set(X.columns)
     [68](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:68)     if len(missing_cols) > 0:
     [69](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:69)         raise ValueError(error_msg)

TypeError: unhashable type: 'list'

Traceback for exception 2:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:7
      [1](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:1) # %%
      [2](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:2) aggjoiner = AggJoiner(
      [3](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:3)     aux_table=join_tables,
      [4](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:4)     aux_key=join_keys,
      [5](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:5)     main_key=main_keys_2,
      [6](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:6) )
----> [7](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/bug_skrub.py:7) joined_table = aggjoiner.fit_transform(main_table)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    [138](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:138) @wraps(f)
    [139](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:139) def wrapped(self, X, *args, **kwargs):
--> [140](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:140)     data_to_wrap = f(self, X, *args, **kwargs)
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:141)     if isinstance(data_to_wrap, tuple):
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:142)         # only wrap the first output for cross decomposition
    [143](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:143)         return (
    [144](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:144)             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    [145](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:145)             *data_to_wrap[1:],
    [146](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/utils/_set_output.py:146)         )

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:878, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    [874](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:874) # non-optimized default implementation; override when a better
    [875](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:875) # method is possible for a given clustering algorithm
    [876](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:876) if y is None:
    [877](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:877)     # fit method of arity 1 (unsupervised transformation)
--> [878](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:878)     return self.fit(X, **fit_params).transform(X)
    [879](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:879) else:
    [880](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:880)     # fit method of arity 2 (supervised transformation)
    [881](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/sklearn/base.py:881)     return self.fit(X, y, **fit_params).transform(X)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:196, in AggJoiner.fit(self, X, y)
    [178](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:178) def fit(self, X, y=None):
    [179](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:179)     """Aggregate auxiliary tables based on the main keys.
    [180](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:180) 
    [181](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:181)     Parameters
   (...)
    [194](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:194)         Fitted :class:`AggJoiner` instance (self).
    [195](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:195)     """
--> [196](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:196)     self.check_input(X)
    [197](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:197)     skrub_px, _ = get_df_namespace(*self.aux_table_)
    [199](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:199)     num_operations, categ_operations = split_num_categ_operations(self.operation_)

File ~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:335, in AggJoiner.check_input(self, X)
    [332](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:332)     check_missing_columns(table, cols, error_msg=error_msg)
    [334](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:334)     if len(aux_key) != len(self.main_key_):
--> [335](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:335)         raise ValueError(
    [336](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:336)             "The number of keys to join must match, got "
    [337](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:337)             f"main_key={self.main_key_!r} and "
    [338](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:338)             f"{aux_key=!r} for the table at index {idx}."
    [339](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:339)         )
    [341](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:341) self.aux_table_ = tables
    [343](https://file+.vscode-resource.vscode-cdn.net/Users/rcap/Projects/benchmark-join-suggestions/~/opt/anaconda3/envs/bench/lib/python3.10/site-packages/skrub/_agg_joiner.py:343) # Check tables and list of cols match

ValueError: The number of keys to join must match, got main_key=['key_1', 'key_1'] and aux_key=['key_1'] for the table at index 0.

Versions

System:
    python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:43:39) [Clang 11.1.0 ]
executable: /Users/rcap/opt/anaconda3/envs/bench/bin/python
   machine: macOS-14.5-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.24.3
        scipy: 1.12.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/rcap/opt/anaconda3/envs/bench/lib/libopenblasp-r0.3.27.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: Nehalem

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/rcap/opt/anaconda3/envs/bench/lib/libomp.dylib
        version: None
0.1.1

Is this a case where the MultiAggJoiner is the right object? cc @TheooJ

If so, we should clarify the docstring of AggJoiner, and raise a clean error when it is given multiple tables, pointing to the MultiAggJoiner

I get this error:

TypeError: 'aux_table' must be a dataframe or the string 'X', got <class 'list'>. If you have more than one 'aux_table', use the MultiAggJoiner instead.

I was working from the version available on pip, so that's why the error message was not clear. Switching to the MultiAggJoiner from the main branch fixed the issue.

Yes, the AggJoiner available on PyPI is not the same as the one on the main branch -- but we made sure to include a useful message for that particular case. It'll be available for the next release !

Btw @rcap107 we should have a chat if you have a wishlist for the (Multi)AggJoiner !

great, I think we can close this then as it is fixed on the main branch. Thanks for checking that @rcap107 !