optimaize/language-detector

Greek is identified as Catalan when no Greek model is loaded

Opened this issue · 5 comments

I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.

Changed phrase to Hebrew and it's still detected as Catalan. Looks like it just returns first loaded model (ca) as 99% confidence of detection...

Yes, it's a documented shortcoming of the current state of this library. I'll cite the readme:

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

I ended up loading all models even that I don't need them all. This way I get more or less predictable outcome at least in the case as described here. @fabiankessler do you want to keep this ticket open or close it and track that improvement somewhere else?

Lingua library loads models as necessary based on detected scripts (as an example implementation of a feature like this). However, loading all those models could use a few gigabytes - whereas Optimaize readme states:
Loading all 71 language profiles uses 74MB ram to store the data in memory

@edudar your primary goal for not loading all languages was to increase speed. I think text of that length would be ~few ms timescale for detection. Did you benchmark detection and find it to be slow when using all language models?
Did you also prevent reloading of models into memory for each usage?

To be honest, I don’t recall all the details from that time and I haven’t been working on that of mine project for probably 3 years already... on the high level, it was for search application and we did detection in real-time and with p99 latency target around 15-20ms even a few ms make a difference.