pyensemble/wildwood

Encoder improvements

Closed this issue 3 years ago · 1 comments

stephanegaiffas commented 3 years ago

Here are a few things to be done in the Encoder :

Very important stuff

Add the code that deals with numpy arrays (for now only pandas dataframes are dealt with)
Handle non-category and non-numerical values as categorical

Important stuff

Put back all numba signatures
check is_categorical, size, dtype, etc.
check that "categories" are for the same columns at the ones passed
Unittests for non-numerical non-categorical columns

Mild stuff

Finish all docstrings
tests are missing, such as for n_features...
check that X has the correct dtype
keep also column and index information to rebuild the exact same

And we could do this in another PR :

fit and transform in parallel over columns (maybe in another PR)
use bitsets for known categories ?
test for categories will too low modalities
Direct numba code for this by testing the -1 directly ?

stephanegaiffas commented 3 years ago

Done in #95