Tabular perturbation is inaccurate when using discretization
TobiasGoerke opened this issue ยท 10 comments
The default tabular perturbation function currently takes a random instance and replaces the perturbed instance's values by the non-fixed feature values of the other instance.
The fixed values remain unchanged.
This is inaccurate when using discretization and even the fixed values should randomly change within their discretized class.
I have not fully thought this through, but yes: it appears that there is more precision to be generated here.
This issue can also cause problems rather than just inaccuracies. The following situation just occured to me:
Assume there is a decision tree which splits feature A at value 5. The instance to be explained has a value of 6. Feature A is discretized to encompass a range from 4 to 6. Now, each time feature A is fixed the value 6 is passed to the model. Hence, we'll get a high precision even though we are ignoring a decision boundary.
just a first thought for this example: bad discretization...
the sequential definition of discretization and anchors is a downfall of the approach - in an ideal world, we would optimize bin-borders and anchors simultaneously.
but to be more productive: I can change the perturbation, but how?
1.) Use value of a random observation that falls in the same class: requires the training set? see issue #39 (and a few values could be reused often..)
2.) Random value within the class: require distribution within the class? (assume uniform?)
I'd like to see 1.) implemented. The advantages of the current perturbation approach are that only such values get passed to the model that actually exist in the train set. If possible, we should keep it that way so that we are not dependent on the type of distribution.
However, as of now, two perturbation functions exist: one for original values and one for discretization. Their results would have to be interconnected for approach 1).
Re: "in an ideal world" - this is similar to what @NoItAll does with the "Magie" approach, right?
Which Milestone are we heading to, here?
alternative 1. is good - the simpler one;) implemented it for review
does not work without training set though when having issue #39 in mind (but I guess that has lower prio)
Re: "in an ideal world" - this is similar to what @NoItAll does with the "Magie" approach, right?
Hi,
Currently MAGIE does not contain this functionality, - I rather suggested it as an outlook on exemplary additions. This ad-hoc discretization using a genetic algorithm was suggested in a paper I read on classification rule mining.
This would also imply the creation of a new index structure, e.g. a k-d-tree, which can deal with continuous data. The question is whether or not the k-d-tree is adequately fast, as the major speed-up of the rule mining in MAGIE stemmed from using roaring bitmaps.
This issue has been fixed in the AutoTuning branch