hashwin/scylla

language detection issues

Opened this issue · 2 comments

"i hate you".language # => "norwegian"
"i hate you so much".language # => "english"
"i love you".language # => "czech"
"kiss me".language # => "finnish"
"talk to me".language # => "italian"

@hashwin How would you suggest to address these issues please?

@Laykou @dom1nga this library is based on textcat which uses n-grams to detect a language, not any particular language's dictionary. It can get confused when the input is very short and is as such unreliable in those cases.

My suggestion would be to only trust the result if the input text is at least 5 words long, 10 to be on the safe side.