Wrong text detection in "no sense" text
Opened this issue · 4 comments
Hi,
I'm having a "de" response with > 0.99 score for a text like the following:
6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\
Citing from the front page:
This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)
Thanks anyway for the submission. It's a good example to demonstrate a limit of the library.
I believe that it should be detectable in the n-grams that the above is not a good match. And another idea is to cross-check with real words... it doesn't contain any.
My doubt is about the high percentage found, which is over 99%. My expectation would be to have a low percentage.
Thanks
+1. Today the probabilities of all detectable languages always add up to 100%, and the library doesn't even attempt to check if the text makes sense.
Would be interesting to see what's detected for a lower-case version of this text. German nouns are all capitalised, so it's plausible that German is the best match as there's decent scores for capital letter unigrams (moreso than for any other language, at least!). all-caps bigrams and trigrams are probably quite rare in the models.
@andrea-bologna , @djelinski may I ask where you got texts like this from? It seems like strange input (though dictionary validation would be a nice feature)