pyensemble/wildwood

Encoder improvements

Closed this issue · 1 comments

Here are a few things to be done in the Encoder :

Very important stuff

  • Add the code that deals with numpy arrays (for now only pandas dataframes are dealt with)
  • Handle non-category and non-numerical values as categorical

Important stuff

  • Put back all numba signatures
  • check is_categorical, size, dtype, etc.
  • check that "categories" are for the same columns at the ones passed
  • Unittests for non-numerical non-categorical columns

Mild stuff

  • Finish all docstrings
  • tests are missing, such as for n_features...
  • check that X has the correct dtype
  • keep also column and index information to rebuild the exact same

And we could do this in another PR :

  • fit and transform in parallel over columns (maybe in another PR)
  • use bitsets for known categories ?
  • test for categories will too low modalities
  • Direct numba code for this by testing the -1 directly ?