Cannot find where you divide ngram count from language file by n_words
Closed this issue · 2 comments
BlazingJ commented
Hello,
i was reading your sources because i am working on language detection for a small projet i have.
I came to your project after the shuyo/language-detection one.
I cannot find in your sources where you convert the integer values associed with each ngrams from the languages files by the total number of analyzed words for this size of ngram.
In shuyo/language-detection project this is done in DetectorFactory.java line 135.
If you don't do this division all result will be biased toward the languages with most words.
Did i miss something in your sources ?
djelinski commented
The stats are recalculated every time a language profile is built.
The division happens here:
BlazingJ commented
Thank you for pointing it out for me.