Some of the English words detect as different language

Question

Some of the English words detect as different language

Opened this issue 2 years ago · 4 comments

Please check the below sheet. For most of the simple English words it detects as different language

Answer 1 · 2023-02-16T23:56:41.000Z

Most language detectors don't work well on very short texts (in this case a single word).
You could use the model's output scores to define a threshold under which no language is detected. Otherwise the language labels on short texts will probably be noisy.

Answer 2 · 2023-06-19T11:24:38.000Z

Why are language detectors so bad on short text? I get that the sample size is small but one would think they would switch approaches to a basic sanity check. e.g., the characters "age" have absolutely no correlation with the characters found in Korean. This seems to be an issue with every language detection library we've used -- pure randomness!

Answer 3 · 2023-07-07T10:53:52.000Z

I feel like this one might be a little better - https://mediapipe-studio.webapps.google.com/demo/language_detector

Answer 4 · 2023-07-07T11:49:17.000Z

Nice suggestion (detects 6/7 correctly)