skrub-data/skrub

Drop numpy array input support for `TableVectorizer`

Closed this issue · 1 comments

Problem Description

As suggested in https://github.com/skrub-data/skrub/pull/786/files#r1376567227.

TableVectorizer currently accepts both dataframes and numpy arrays as inputs. It outputs numpy arrays.
We suggest to drop numpy array input support for several reasons:

  • Data scientists mainly work with dataframes. They seldom manipulate numpy arrays within models.
  • TableVectorizer is designed to dispatch encoders based on dtypes. As numpy arrays only have a single dtype, supporting them displays the wrong message and defeats the purpose of the TableVectorizer.
  • Handling both dataframes and numpy array inputs obfuscates the logic.
  • Looking ahead, it's a good step toward only using dataframes operations within TableVectorizer, without numpy conversion that might make copies and break laziness.

Feature Description

Raise errors when numpy arrays are passed to the TableVectorizer.

Alternative Solutions

No response

Additional Context

No response

Closing because this has been addressed in #902 with CheckInputDataFrame