Some of the English words detect as different language
Opened this issue · 4 comments
Most language detectors don't work well on very short texts (in this case a single word).
You could use the model's output scores to define a threshold under which no language is detected. Otherwise the language labels on short texts will probably be noisy.
Why are language detectors so bad on short text? I get that the sample size is small but one would think they would switch approaches to a basic sanity check. e.g., the characters "age" have absolutely no correlation with the characters found in Korean. This seems to be an issue with every language detection library we've used -- pure randomness!
I feel like this one might be a little better - https://mediapipe-studio.webapps.google.com/demo/language_detector
Nice suggestion (detects 6/7 correctly)