camspiers/statistical-classifier

Match threshold for Naive Bayes Classifier?

Closed this issue · 5 comments

I'm using the Naive Bayes Classifier for a data set, but I can't find any way to get the "certainty" value of a classification.

I want to use this so that data that does not belong to any category isn't given a random category because the amount of documents learned isn't sufficient.

For example, consider this data set:

$source->addDocument('pig', 'Pigs are great. Pink and cute!');
$source->addDocument('wolf', 'Wolves have teeth. They are gray.');

Now, we classify some junk data:

echo $c->classify('0943jf904jf09j34fpj'), PHP_EOL; // wolf??

This will return "wolf". But clearly, there is nothing in the classified string to justify that. So I would like for the classifier to tell me somehow that the match for "wolf" is very weak. (Non-existant in this case.) That way I could discard the match and match it against an "uncategorized" category.

I'm not familiar with the internals of the classifier, but there should be a certainty value in there somewhere to signify how strong a classification is.

Let me know if you need any more info.

For reference, there is some info on these "threshold" values in chapter 2.5 of this page:
https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html

Hey @khromov I will have a chance to think about this within the next day, but in the meantime, the place to start is the paper that the classification algorithm is derived from: Tackling the Poor Assumptions of Naive Bayes Text Classifiers.

I don't think that we can achieve the threshold style classification specifically for the Complement NB algorithm (the currently implemented one). But it is easy to write a less performant NB algorithm which can use a threshold. And my intention was to provide different classifiers for different purposes.

Also check out the SVM classifier.

@khromov I have added the threshold functionality to the SVM classifier. Still doesn't look like I will be able to add it to the complement NB algo.

@khromov I have introduced the ability to at least return false (from classify) when there is nothing about the document that should cause it to be classified in one category over another. This handles the case when the document matches two classes to the same degree, but also handle when there are no features of the document that match the features of the class. It is unlikely I will be able to include threshold functionality into the CNB classifier, so with this addition I am closing this ticket for now.

Oh, and the tag is 0.6.3

Thanks @camspiers, you went above and beyond to add this functionality. Much appreciated!