georgian-io/Multimodal-Toolkit

Bad dtype when categorical_encode_type is set to label

Closed this issue · 3 comments

While trying to tune my model I found that somewhere in load_data categorical features have object numpy type which can't be turned to torch tensor in TorchTabularTextDataset.

On another note, I think it's hard to productionize models trained with your library, because of preprocessing. They work only for on all subset of data (train, validation, test) or only for one of each which can and will cause errors, because each subset can have different categories. I feel like loading data should be changed - CategoricalFeatures (it should also be saved so you can use it later) should be fitted on training_data and then only transform validation and test (and assign unknown categories some default other category).

Also it would be awesome to be able to create pipelines like in transformers or sklearn for quick experimentation.

Hi @kondera we'll fix this bug for the next release!

Thanks for the point about the preprocessing. I think that's a very valid point and it definitely makes sense. We'll look into what we can do about this. The pipeline bit is also something in our roadmap so we plan on adding it in at some point.

Hi @kondera could you offer some more information on this bug in load_data? I'm unable to reproduce this issue so would appreciate your help!

Closing this issue due to lack of activity.