interpretml/interpret

Doing some treatment for missing values

p9anand opened this issue · 13 comments

Can we add some random values/treat them as ""null"" for missing data. So that we can avoid error in during data exploration using :
hist = ClassHistogram().explain_data(X_train, y_train, name = 'TrainData').

Hi @p9anand - thanks for bringing this up!

We're working on missing values at the moment.
The approach we're following is to treat missing as its own special value, and visualize appropriately. We'd seen some strange problems in the past with imputation, and we're hoping to avoid it this way.

Exciting implementation, and promising algo when intelligibility is as important as accuracy!

Can you share how you are planning to implement missing value support?

The paper does not mention missing values. As I understand it, GA2M is not fitted with splines but rather regression trees.

Then I could see two approaches:

  1. In XGBoost missing values are supported by learning default branch direction during training (unless using gblinear booster). The same could be done for the univariate and bivariate trees.

  2. One-hot encode the missing value (f.e. with sklearn.impute.MissingIndicator) and learn a B co-efficient

Also interested in ideas how to put missing value info back into the visualizations.

I also suffer from lack of support of missing values by ExplainableBoostingRegressor. I deal with cross-species gene expression datasets with many NA-s :_(

It is pretty straightforward to integrate missing values for categorical variables by simply adding another category. For continuous variables I tried a very simple workaround (slight modification of the EBM code) by encoding unknown with "minimum value - 1", so that its smaller than all regular values and treated separately during training. However, it seems that the EBM algorithm often ignores these outlier values even if they occur very frequently and missing ends up getting the same risk as the minimum value. Hence, it is not treated as the special value it is. Any ideas how to make this simple modification work for continuous values? Or is there any progress on a native implementation of missing values for the EBM algorithm?

Thanks!

Hi @p9anand, @timvink, @antonkulaga, @stefanhgm -

Our latest release includes some work to handle missing values better. To enable this though, you'd need to modify the code in ebm.py by changing all the places it has "missing_data_allowed=False" into "missing_data_allowed=True" and then it should work. We didn't include this in the release since our graphing code still needs to be updated to handle missing values, but the underlying core framework should handle them now. The graphing code should still function, but it won't show the missing value bin that gets created. If you want to see the missing value score today, you would need to check the additive_terms_ field, which has the missing value in index 0.

The change needed to re-enable missing value support would be to do the opposite of:
f298c73

The current implementation puts all the missing values into it's own bin on the left side. We do plan to improve this and implement the XGBoost method where the missing values are merged into the side that improves gain the most on each boosting step.

-InterpretML team

Hi.

Thanks for coming back on this issue and for your detailed explanation! Works as described for me and, unsurprisingly, produces very similar results as the workaround I mentioned earlier. Just as a clarification: as far as I see it you create an extra bin at position [0] (1D) or [0,...]/[...,0] (2D) that possibly receives a non-zero weight during training even if there are no missing values? And is there any indicator in EBM that tells me if a variable has missing values? Otherwise, I could also infer that from the data myself.

Implementation of the XGBoost method would be awesome. Is that something that will be accomplished in the next weeks or will it take more time? Just wondering for a current project. For visualization, we used the aforementioned workaround for missing values and a custom plotting implementation in a recent project where we reserved 10% of the axes for a visually separated unknown bin (see Appendix in https://bit.ly/39RcYex).

Hi @stefanhgm --

Thanks for sending us a link to your paper. It's great to see new research in the EBMs/GA2Ms visualization space!

You can find out if there were missing values in the dataset by looking at the field ebm.preprocessor_.col_bin_counts_[feature_index][0], which should contain the count of missing values observed.

Yes, the resulting models should look similar to how it would look if missing values were given an extreme outlier value, with the additional benefit that missing values are now guaranteed to be in their own bin and not merged with real data if there are insufficient amounts of missing values to put them into their own bin. If you have a dataset with no missing values for a feature, then the resulting model should have logits of zero in the missing value bin. If you are seeing something other than zero, then we need to figure that out. Boosting will create an illusionary value in the missing bin if there is no data there, but we force the value to zero after post-processing.

For 1D, your representation is correct. To be a little clearer on how this works for pairs, if before you had two binary features and the following matrix for the pair in the model:
[0.1 0.2]
[0.3 0.4]

What you would now get with the missing values change is a 3rd missing bin on each dimension, so a 3x3 matrix that looks like this:

[0.0 0.0 0.0]
[0.0 0.1 0.2]
[0.0 0.3 0.4]

With the interesting aspect that there is now a bin for the case where both features are missing in the [0, 0] location. This missing value handling method can continue to higher dimensions once we start supporting 3 way and higher tensor interactions.

IMHO, missing value handling is going to be one of the more interesting aspects of the EBM model class. One important aspect is that we'll likely be able to make reasonable predictions in scenarios where the training data has no missing values, but they occur during prediction. So, for instance if you had a sensor of some kind that was working while training data was collected, but which failed at a later date, EBMs will still be able to use the other features available to make reasonable predictions. Each feature is independently mean centered, so if a feature is missing, a reasonable choice is to simply set it to the expected value of 0. Of course, you could do even better by retraining the model if there is a change in a feature definition like this, but we expect that setting to zero will be a reasonable choice in many applications for handling unexpected changes in the data.

The other interesting aspect is that EBMs should make it visually clear if the missing values contain any information by the fact that they are missing. If the missing values have a signal, as they often do in medicine where diagnostic tests are not ordered when they don't seem relevant, we should learn a signal from the fact that data is missing, and the resulting model should have a non-zero value in the missing bin. In cases where missing values were caused by a more random process like flipping a coin, you would expect the value in the missing bin to be closer to zero. This will more likely be true though after we implement the XGBoost method of training as it should be more balanced.

I would expect better missing value handling to take longer than a few weeks to make it into the package, in part because it's a new moving part and we want to give the new representation some time to solidify throughout our codebase before relying on it more. We also need to tackle getting missing values into our visualization system, so that they can be seen by our users.

-InterpretML team

@interpret-ml
Is there any news regarding support for missing values? Thanks

Hello @stefanhgm, @timvink, @p9anand, @antonkulaga, @candalfigomoro! Thank you so much for using EBM! I am Jay Wang, a research intern at the InterpretML team. We are developing a new visualization tool for EBM and recruiting participants for a user study (see #283 for more details).

We think you are a good fit for this paid user study! If you are interested, you can sign up with the link in #283. Let me know if you have any question. Thank you!

BTW I really enjoy reading your paper "An Evaluation of the Doctor-Interpretability of Generalized Additive Models with Interactions" @stefanhgm!

Hi @xiaohk

Thank you very much! Btw we also released the code for visualization used in our experiments. It is based on Java Spring and a JS frontend and includes treatment of missing values: https://github.com/stefanhgm/EBM-Java-UI. We will also publish another paper where we use EBMs to predict ICU readmission very soon. It includes inspection and model editing through a team of doctors.

Your study sounds very cool, I just signed up. Thanks for the hint!

@stefanhgm Thanks! Your visualization tool looks very cool! I will send you an email for more details about the study.

@interpret-ml
Missing values are pretty common in real-world business data. Is the EBM able to handle them without having to impute them (like LightGBM does)? Imputation may be meaningless in some cases.

Missing values are now exposed in the latest v0.3.0 release without the need to modify the code. We still don't have UI to show them, but we instead give a warning to indicate this and indicate how to access the missing value scores programatically. Leaving the issue open until we have UI to show them.