Tabular perturbation is inaccurate when using discretization

Question

Tabular perturbation is inaccurate when using discretization

TobiasGoerke opened this issue 5 years ago · 10 comments

The default tabular perturbation function currently takes a random instance and replaces the perturbed instance's values by the non-fixed feature values of the other instance.
The fixed values remain unchanged.
This is inaccurate when using discretization and even the fixed values should randomly change within their discretized class.

Answer 1 · 2019-08-02T06:26:18.000Z

I have not fully thought this through, but yes: it appears that there is more precision to be generated here.

Answer 2 · 2019-08-02T08:53:53.000Z

This issue can also cause problems rather than just inaccuracies. The following situation just occured to me:
Assume there is a decision tree which splits feature A at value 5. The instance to be explained has a value of 6. Feature A is discretized to encompass a range from 4 to 6. Now, each time feature A is fixed the value 6 is passed to the model. Hence, we'll get a high precision even though we are ignoring a decision boundary.

Answer 3 · 2019-08-05T08:15:23.000Z

just a first thought for this example: bad discretization...
the sequential definition of discretization and anchors is a downfall of the approach - in an ideal world, we would optimize bin-borders and anchors simultaneously.

Answer 4 · 2019-08-05T08:24:19.000Z

but to be more productive: I can change the perturbation, but how?
1.) Use value of a random observation that falls in the same class: requires the training set? see issue #39 (and a few values could be reused often..)
2.) Random value within the class: require distribution within the class? (assume uniform?)

Answer 5 · 2019-08-05T08:32:27.000Z

I'd like to see 1.) implemented. The advantages of the current perturbation approach are that only such values get passed to the model that actually exist in the train set. If possible, we should keep it that way so that we are not dependent on the type of distribution.
However, as of now, two perturbation functions exist: one for original values and one for discretization. Their results would have to be interconnected for approach 1).

Answer 6 · 2019-08-05T11:04:48.000Z

Re: "in an ideal world" - this is similar to what @NoItAll does with the "Magie" approach, right?

Answer 7 · 2019-08-05T11:05:47.000Z

Which Milestone are we heading to, here?

Answer 8 · 2019-08-05T11:36:32.000Z

alternative 1. is good - the simpler one;) implemented it for review

does not work without training set though when having issue #39 in mind (but I guess that has lower prio)

Answer 9 · 2019-08-05T11:42:52.000Z

Re: "in an ideal world" - this is similar to what @NoItAll does with the "Magie" approach, right?

Hi,
Currently MAGIE does not contain this functionality, - I rather suggested it as an outlook on exemplary additions. This ad-hoc discretization using a genetic algorithm was suggested in a paper I read on classification rule mining.
This would also imply the creation of a new index structure, e.g. a k-d-tree, which can deal with continuous data. The question is whether or not the k-d-tree is adequately fast, as the major speed-up of the rule mining in MAGIE stemmed from using roaring bitmaps.

Answer 10 · 2019-08-20T11:55:24.000Z

This issue has been fixed in the AutoTuning branch