skrub-data/skrub

systematically handling column names and indexes of transformed dataframes

Opened this issue · 1 comments

when we transform a dataframe we want to make sure that in the output

  • the column names are always the same (and unique)
  • if it is a pandas dataframe, the index is preserved
  • possibly other checks performed by CheckInputDataframe

see for example this comment

I'm opening this now just so we don't forget about it

Agreed !

It would also be useful to check dataframe types between main and aux. For now, I believe only AggJoiner checks that both have the same type in X, self._aux_table = self._check_dataframes(X, self.aux_table), but we probably want this in the other joiners too.

We could use something like:

self._aux_check_input = CheckInputDataFrame()
self._aux_table = self._aux_check_input.fit_transform(self.aux_table)

self._main_check_input = CheckInputDataFrame()
main = self._main_check_input.fit_transform(main)

if self._main_check_input.module_name_ != self._aux_check_input.module_name_:
   ...

For now,

  • the Joiner uses CheckInputDataFrame for main and aux, but doesn't check the type.
  • the InterpolationJoiner doesn't use CheckInputDataFrame. Note that here, main might not be known at the time of fitting.