skrub-data/skrub

Faster alternative to GapEncoder

Opened this issue · 1 comments

Problem Description

For encoding text/high-cardinality categories, ATM we have MinHashEncoder, which only works when the downstream learner is based on decision trees, and GapEncoder, which gives high-quality representations but is very slow. It would be good to have something similar to the GapEncoder but faster, maybe a SVD or scikit-learn's NMF

Feature Description

an encoder that works similarly to GapEncoder but is faster, possibly at the cost of less interpretable topics or slightly reduced prediction performance

related: #139