skrub-data/skrub

Potential performance issue: .to_dict method slow in pandas below 2.2

Closed this issue · 2 comments

Problem Description

Hello.
I have discovered a performance degradation in the .to_dict function of pandas version 1.5.3. And I noticed that some parts of the repository depend on the pandas version 1.5.3. I found that many files such as skrub/_table_vectorizer.py used the influenced api. There may be more files using the influenced api. I am not sure whether this performance problem in pandas will affect this repository. Here are some discussions on pandas GitHub related to this issue, including #50990 and #54824.

Feature Description

I would recommend considering an upgrade to a different version of pandas >= 2.2 or exploring other solutions to optimize the performance.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Alternative Solutions

No response

Additional Context

No response

hello, thanks for investigating and reporting this!
a lot of this code is likely to change due to (i) adding support for polars dataframes, #888 and (ii) refactoring the table vectorizer, #877.

in any case 1.5.3 is a rather old version, so users can always update their version of pandas and benefit from the fast to_dict in recent versions.
1.5.3 is the oldest supported version, but skrub works with more recent versions of pandas too (including the latest release). by default pip and conda will install the latest version

I think this can be close because if a user is experiencing bad performance due to this they can just upgrade their pandas version (If I understand correctly). But if I misunderstood or am missing something feel free to reopen, @TendouArisu