ShortText algorithm sometimes yields zero probabilities for all languages

Question

ShortText algorithm sometimes yields zero probabilities for all languages

Closed this issue 9 years ago · 1 comments

AlbertWeichselbraun commented 9 years ago

detectBlockShortText does not break, once CONV_THRESHOLD has been reached. Depending on the text size this leads to zero probabilities for all languages.

Example:

The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.

How to reproduce:

add the following line to runTests in the DataLanguageDetectorImplTest unittest:

assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");

Answer 1 · 2018-08-10T08:09:20.000Z

It seems to me the break introduces a bug, please see #91.