GAMChanger fails to load data in some cases
aaronsnoswell opened this issue · 9 comments
I've found a bug where GAMChanger sometimes doesn't populate the 'metrics' / 'feature' / 'history' panel. It seems that when this happens, the GAMChanger interface has failed to load the validation samples, because the status bar says "0/0 validation samples selected".
This seems to occur sometimes based on the data that is provided, and might have something to do with missing data points, but I'm struggling to figure out exactly what the cause is.
Below is the smallest reproducing example I can come up with.
See following comment for a better MWE.
import pandas as pd
import gamchanger as gc
from interpret.glassbox import ExplainableBoostingRegressor
# Works
X = pd.read_csv('demo-X-succeed.csv')
y = pd.read_csv('demo-y-succeed.csv')['OrderedFractionOfEstate']
# Doesn't work
#X = pd.read_csv('demo-X-fail.csv')
#y = pd.read_csv('demo-y-fail.csv')['OrderedFractionOfEstate']
ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)
gc.visualize(ebm, X, y)
I've attached the CSV files, which differ in that the 'succeed' files have a single extra data point. That is, when loading 'demo-[X|y]-fail.csv' the GamChanger interface loads, but the side panel doesn't populate (unexpected behaviour). When loading 'demo-[X|y]-succeed.csv', the GamChanger interface loads and the side panel populates the metrics as expected.
demo-X-fail.csv
demo-X-succeed.csv
demo-y-fail.csv
demo-y-succeed.csv
I have produced a more compact MWE;
import numpy as np
import pandas as pd
import gamchanger as gc
from interpret.glassbox import ExplainableBoostingRegressor
size = 5
x1 = np.linspace(0, 10, size)
y = -1.0 * x1.copy() + 3.0
# Introduce missing data
x1[1] = np.nan
x1[2] = np.nan
# With only two missing datapoints, the GAMChanger interface loads fine
# If we introduce a third missing feature valueby un-commenting the below
# line, the validation data fails to load
#x1[3] = np.nan
df = pd.DataFrame(
data={
'x1': x1,
'y' : y
}
)
X = df[['x1']]
y = df['y']
print(df)
# Train model
ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)
gc.visualize(ebm, X, y)
...update... based on the above MWE, I have been able to narrow down the error to this javascript uncaught error in the Firefox JS console;
...which I believe is coming from the variable messenger_js_base64
at gamchanger.py:528.
For the failing case, this javascript (before base64 encoding) looks like this;
(function() {
let data = {
"model": {
"intercept": -0.849715269828704,
"isClassifier": false,
"features": [
{
"name": "x1",
"type": "continuous",
"importance": 0.14849663043758726,
"additive": [-0.1856, -0.1856],
"error": [0.7972, 0.7972],
"id": [0],
"count": [1, 1],
"binEdge": [0.0, 5.0, 10.0],
"histEdge": [0.0, 10.0],
"histCount": [2]
}
],
"labelEncoder": {},
"scoreRange": [-0.9828, 0.6905]
},
"sample": {
"featureNames": ["x1"],
"featureTypes": ["continuous"],
"samples": [[0.0], [NaN], [NaN], [NaN], [10.0]],
"labels": [3.0, 0.5, -2.0, -4.5, -7.0]
}
};
let event = new Event('gamchangerData');
event.data = data;
console.log('before');
console.log(data);
document.dispatchEvent(event);
}())
Following the rabbit trail from the gamchangerData
event down, I can see that this event is intercepted at GAM.svelte:663, which calls initDataLoaded
, which calls initGAMView
theninitSidebar
.
Of these two functions, initSidebar
is the only one that uses a Promise
(which is mentioned in the JS error), so perhaps initSidebar
at
Line 447 in ec85c7a
At this point, my knowledge of Typescript and WASM is stopping me from investigating this bug further. I suspect the issue is coming from initGAMView
or initSidebar
, but without the ability to debug and iterate with a non-minified and base64 encoded version of GAMChanger I can't look into this more.
I would very much appreciate help from the devs to track down this bug! Presently, this is preventing me from using GAMChanger with my application (predicting court case outcomes).
Wow @aaronsnoswell thank you so much for your detailed report and effort in debugging this issue!
I tried to reproduce this error using your example, but I got a ValueError when fitting an EBM model with missing values. I believe EBM does not support missing value yet? My interpret
version is 0.2.7
x1 y
0 0.0 3.0
1 NaN 0.5
2 NaN -2.0
3 7.5 -4.5
4 10.0 -7.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-13a640a7d74a> in <module>
32 # Train model
33 ebm = ExplainableBoostingRegressor(interactions=False)
---> 34 ebm.fit(X, y)
~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y, sample_weight)
822 # AND add some tests for the X.dim == 1 scenario
823
--> 824 # TODO PK write an efficient striping converter for X that replaces unify_data for EBMs
825 # algorithm: grap N columns and convert them to rows then process those by sending them to C
826
~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
340 msg = "Missing values are currently not supported."
341 log.error(msg)
--> 342 raise ValueError(msg)
343
344 return new_data, new_labels, new_feature_names, new_feature_types
ValueError: Missing values are currently not supported.
Hi @xiaohk thanks for getting back to me!
The latest versions of Interpret have experimental support for missing values - I forgot to mention that I am using this experimental code. To enable it, you need to change a few places in the interpret source code. See this comment on interpretml/interpret#18 for the details.
So for instance, I checked pip show interpret
to get the install location, then opened ebm.py
, and changed all instances of missing_data_allowed=False
to missing_data_allowed=True
in ebm.py
.
After doing that, the example should work.
Thanks again!
I see, thanks!
EBM's experimental support for missing values introduces a separate additive_scores[0]
that is only used when computing the prediction score for missing values. To fully support this, GAM Changer has to visualize the missing value "bin" in the GAM Canvas — I will leave it for future work. For now, GAM Changer will remove all rows in X
that contains any missing values in get_sample_data()
.
fb7ba18#diff-de8698f459a11697fd2d6614444871f69e802ae1af354cd35aba32e62e6698bbR267-R277
I will close this issue for now. Let me know if it doesn't work for you @aaronsnoswell. Thanks for reaching out to me!
Thanks for looking into this, @xiaohk!
fb7ba18 seems like a good patch for now. Dropping all rows with missing values is pretty rough for users with real-world data though in the longer run :D
I'd be happy to take a stab at adding proper support for missing values if you can provide a little guidance for me. E.g. could you draw a sketch / doodle of what the GAMChanger interface should look like to show the missing value bin (where this would go in the interface?). Also, is there any documentation about setting up a development environment for GAMChanger?
Thank you so much for your interest! I believe supporting missing value will be super helpful. Adding this feature might sound straightforward, but I am sure it would require A LOT of work. 😅
Some high-level steps:
- Support missing value in the EBM inference in WebAssembly
- Need to handle continuous, categorical features, and interaction terms (implementations are different)
- Visualize the missing value
- Continuous feature: a separate dot on the line chart
- Categorical feature: a separate bin in the bar chart
- Cont X Cont interaction: a new row and a new column in the matrix
- Cont X Cat interaction: many new separate bars (when NA happens in cont) or a new bar (when NA happens in cat)
- Cat X Cat interaction: a new row/column of dots
- Interaction with the missing value
- Integrate editing tools to support missing values: e.g., align/interpolate/monotonicity do not really make sense.
- We need to specially handle/prevent users from selecting regular bins and NA bin all at once
- Integration with other views
- Feature panel
- History panel: new event logging when editing missing values
- Footer: new event name when editing missing values
- Loading
.gamchanger
files- Need to load missing value related parameters
To set up a development environment for GAM Changer:
git clone git@github.com:interpretml/gam-changer.git
# Install the dependencies:
npm install
# Start a development server
npm run dev
You might have noticed that the EBM inference and isotonic regression WebAssembly code are shipped as binaries in this repo. Their source code is at xiaohk/ebm.js and xiaohk/isotonic.js, respectively.
If you are interested, I am happy to provide any sketches, feedback, and guides that can help you! It would be a hard and rewarding contribution to GAM Changer!
Wow :) That does sound like a lot of work.
A first point - interpretml
doesn't currently support missing values for continuous features, and I don't believe they plan to - that would potentially reduce some of the interactions you mention above and simplify the workload.
Perhaps a good starting point is to figure out how stable the interpretml
missing value support is. Assuming I can rely on it not changing too much, I could potentially bite off part of this work list in a new branch to get the ball rolling. I will inquire over there and report back.
P.S. Thanks for the dev environment instructions.