google/cld3

Some of the English words detect as different language

Opened this issue · 4 comments

Please check the below sheet. For most of the simple English words it detects as different language
image

Most language detectors don't work well on very short texts (in this case a single word).
You could use the model's output scores to define a threshold under which no language is detected. Otherwise the language labels on short texts will probably be noisy.

Why are language detectors so bad on short text? I get that the sample size is small but one would think they would switch approaches to a basic sanity check. e.g., the characters "age" have absolutely no correlation with the characters found in Korean. This seems to be an issue with every language detection library we've used -- pure randomness!

AmitMY commented

I feel like this one might be a little better - https://mediapipe-studio.webapps.google.com/demo/language_detector

Nice suggestion (detects 6/7 correctly)