georgian-io/Multimodal-Toolkit

Imputation of numerical data

kopant opened this issue · 1 comments

Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.

That's a good idea, thanks! We'll incorporate that when doing the enhancement.