optimaize/language-detector

Cannot find where you divide ngram count from language file by n_words

Closed this issue · 2 comments

Hello,

i was reading your sources because i am working on language detection for a small projet i have.
I came to your project after the shuyo/language-detection one.
I cannot find in your sources where you convert the integer values associed with each ngrams from the languages files by the total number of analyzed words for this size of ngram.

In shuyo/language-detection project this is done in DetectorFactory.java line 135.
If you don't do this division all result will be biased toward the languages with most words.

Did i miss something in your sources ?

The stats are recalculated every time a language profile is built.
The division happens here:

double prob = frequency.doubleValue() / profile.getNumGramOccurrences(ngram.length());

Thank you for pointing it out for me.