optimaize/language-detector

misdetection because of break at CONV_THRESHOLD

Opened this issue · 1 comments

#40 introduced this line in detectBlockShortText:

if (Util.normalizeProb(prob) > CONV_THRESHOLD) break;

However, I found a case in which a text with 100 characters that's clearly German is identified as Dutch. This does not happen when I comment out the break (but don't comment out the Util.normalizeProb(prob)).

Code to reproduce:
https://gist.github.com/danielnaber/6f738fca065e87a5d067710aabaa1883

Just for history, in my project JUnit tests was fail because of zero probability(version 0.5).
But if you run it as regular run - all probabilities are good.
This behaviour happens only under windows(In my case win 10 x64, oracle java 8).
Under linux(orcale java 8) - it's good in both cases(regular and JUnit).

Problem solved by upgrading to the version 0.6 and it seems it related to this break.