google/cld3

Did I hit a bug in gcld3?

Opened this issue · 0 comments

I made a few experiments to find out what would be the result of detection for some text not in the supported languages list. While it appears that whatever is detected is unreliable so I can reject the detection, I stumbled upon an example where the result is unexpectedly bad

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=1000)
sample = "The last part of this text is pure gibberish with well crafted punctuation. Този текст е на Български. Sdslkmnscd scsun dc mcsaducsdnmlmc icmmklmdsc!"
result = detector.FindTopNMostFreqLangs(text=sample, num_langs=5)
for i in result:
    print(i.language, i.is_reliable, i.proportion, i.probability)```

will surprisingly output this:
`en True 0.4444444477558136 0.9999370574951172
bg True 0.28070175647735596 0.9173890948295593
hu True 0.27485379576683044 0.9084945917129517
und False 0.0 0.0
und False 0.0 0.0`

for one part good text and second part garbage it depends which is first and which has bigger proportion but the result can be correctly interpreted, however the above example is quite bad.