ClearTK/cleartk

Division by zero in ZeroMeanUnitStddevExtractor

Closed this issue · 8 comments

Original issue 399 created by ClearTK on 2014-01-16T18:24:05.000Z:

In case a particular feature only occurred once while training a ZeroMeanUnitStddevExtractor (or always occurred with the same value), stats.stddev will be 0, and so (value - stats.mean) / stats.stddev will be NaN leading to problems down the line. I am not sure what would be the best solution here.

Comment #1 originally posted by ClearTK on 2014-03-15T18:11:10.000Z:

This issue reminds me a bit of Issue-396 - might be useful to consult that fix for this.

Comment #2 originally posted by ClearTK on 2014-04-12T18:02:48.000Z:

Alexey, I'm curious to know if you have thought any more about how you would like this issue to be resolved. I am inclined to recommend that we modify org.cleartk.ml.feature.transform.extractor.ZeroMeanUnitStddevExtractor.train() so that it throws an exception if the stddev is 0 when it is writing out the MeanVarianceRunningStat objects. It seems like a very poor choice of a feature if it only occurs once or it is always the same value. I think it would be better to make sure it fails during training rather than somehow try to make it work when classifying. What is your thought?

Comment #3 originally posted by ClearTK on 2014-04-12T18:34:41.000Z:

The feature I ran into it with was 3-grams, which will quite often be unique in the learning data, and while the proper thing to do would be to add a special feature for unique (or rare) n-grams, is there a simple way to do this in ClearTK? I don't remember seeing one, but I am no longer using ClearTK actively.

Comment #4 originally posted by ClearTK on 2014-04-12T18:47:36.000Z:

I'm not understanding the use case. Are you saying that you were training a ZeroMeanUnitStddevExtractor for each unique 3-gram in your training data? For starters, that feature extractor is meant for numeric features.

I would think that a TF-IDF feature for bag-of-ngrams might be a good starting place.

Comment #5 originally posted by ClearTK on 2014-04-12T23:26:00.000Z:

Never mind my last comment. I just did a unit test on MaxMinNormalizationExtractor and now I understand why you would count 3-grams and submit them to the ZMUSE. I'll have to make a best guess as to the correct behavior when I write the unit test here.

Comment #6 originally posted by ClearTK on 2014-04-13T01:16:38.000Z:

Ok - I think that if a feature only occurs once or it always has the same value, then it is reasonable to return zero if the feature value being transformed is the same as the mean. However, this is likely going to be a pretty worthless feature. If the feature value being transformed is something different than the mean, then I would consider that to be undefined. Even if we came up with a reasonable estimate/default value it still isn't likely to be a useful feature. I think a reasonable thing to do when transforming such features is to return nothing and the list of returned features from the extract method will be shorter.

Comment #7 originally posted by ClearTK on 2014-04-13T03:10:00.000Z:

if stddev = 0 or if a feature from the sub extractor has never been seen before, then we will not create a zmus feature for it.

Comment #8 originally posted by ClearTK on 2014-04-13T07:06:34.000Z:

Yes, this seems very reasonable.