optimaize/language-detector

ShortText algorithm sometimes yields zero probabilities for all languages

Closed this issue · 1 comments

detectBlockShortText does not break, once CONV_THRESHOLD has been reached. Depending on the text size this leads to zero probabilities for all languages.

Example:

The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.

How to reproduce:

add the following line to runTests in the DataLanguageDetectorImplTest unittest:

assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");

It seems to me the break introduces a bug, please see #91.