optimaize/language-detector

Chinese detection is unusable

Opened this issue · 5 comments

I noticed that even slightest inclusion of English terms (like brand names) result into absolute nonsense as an output:

Language: [it] 
Score: [0.15132266] 
Text: [小猫终于换发型! Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发
Language: [nl] 
Score: [0.2858114] 
Text: [东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期来到日本最大型的音乐节!来看看日本潮人参加音乐节都是穿什么的哦!潮人搭配:上衣Gamber;裤子Galson;鞋子Reebok;背包Nike。
Language: [nl] 
Score: [0.52030957] 
Text: [东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山,男生的潮人们可以参考一下!这里必逛:TSUTAVA BOOKS。采访潮人搭配:T恤Bobson;衬衫Rage Blue;裤子Diesel;帽子CA4LA;鞋子Nike Jordan。
Language: [zh-CN] 
Score: [0.5714257] 
Text: [光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容]
Language: [es] 
Score: [0.285714] 
Text: [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道,这里聚集了潮牌的精品店,所以各位喜欢潮牌的男士一定不能错过!本期采访潮男搭配:Lad Musician上衣;WEGO裤子;Dr.Martens鞋子。

What makes it worse is that I don't know beforehand that text is in Chinese to make a decision to remove Latin characters. That's what I want to get from language detector and then act accordingly to result. That's not possible at the moment.

@edudar ,
If you want to just detect Chinese, you could use java unicode character block...along with language detector as many Chinese character are used in Japanese and Korean as well.

Regards,
Supriti

Is this planned on being fixed? Or already fixed?

I thought this may be related to #86 , so here's my results with that Java snippet:

小猫终于换发型! Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发

detectedLanguages = {ArrayList@1356}  size = 1
 0 = {DetectedLanguage@1366} "DetectedLanguage[zh-CN:0.9999999971838761]"
detectedLanguagesNormalised = {ArrayList@1357}  size = 1
 0 = {DetectedLanguage@1369} "DetectedLanguage[zh-CN:0.999999999943355]"
  locale = {LdLocale@1371} "zh-CN"
  probability = 0.999999999943355

didn't detect it, so maybe models changed since original post?


[东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期来到日本最大型的音乐节!来看看日本潮人参加音乐节都是穿什么的哦!潮人搭配:上衣Gamber;裤子Galson;鞋子Reebok;背包Nike。

detectedLanguages = {ArrayList@1360}  size = 2
 0 = {DetectedLanguage@1370} "DetectedLanguage[en:0.14302056598227653]"
 1 = {DetectedLanguage@1371} "DetectedLanguage[zh-CN:0.1428568882090279]"
detectedLanguagesNormalised = {ArrayList@1361}  size = 3
 0 = {DetectedLanguage@1375} "DetectedLanguage[eu:0.5714251799067207]"
 1 = {DetectedLanguage@1376} "DetectedLanguage[vi:0.14285710870888324]"
 2 = {DetectedLanguage@1377} "DetectedLanguage[zh-CN:0.14285705001212348]"

getting nonsense


[东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山,男生的潮人们可以参考一下!这里必逛:TSUTAVA BOOKS。采访潮人搭配:T恤Bobson;衬衫Rage Blue;裤子Diesel;帽子CA4LA;鞋子Nike Jordan。

detectedLanguages = {ArrayList@1359}  size = 2
 0 = {DetectedLanguage@1369} "DetectedLanguage[de:0.4285712871573067]"
 1 = {DetectedLanguage@1370} "DetectedLanguage[zh-CN:0.14285676980231643]"
detectedLanguagesNormalised = {ArrayList@1360}  size = 3
 0 = {DetectedLanguage@1374} "DetectedLanguage[vi:0.5714266107314225]"
 1 = {DetectedLanguage@1375} "DetectedLanguage[no:0.14285987254624485]"
 2 = {DetectedLanguage@1376} "DetectedLanguage[zh-CN:0.14285635710576566]"

nonsense


光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容

detectedLanguages = {ArrayList@1356}  size = 1
 0 = {DetectedLanguage@1369} "DetectedLanguage[zh-CN:0.996796417137037]"
detectedLanguagesNormalised = {ArrayList@1357}  size = 1
 0 = {DetectedLanguage@1366} "DetectedLanguage[zh-CN:0.9999843647848928]"

东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道,这里聚集了潮牌的精品店,所以各位喜欢潮牌的男士一定不能错过!本期采访潮男搭配:Lad Musician上衣;WEGO裤子;Dr.Martens鞋子

detectedLanguages = {ArrayList@1359}  size = 1
 0 = {DetectedLanguage@1369} "DetectedLanguage[zh-CN:0.8571418737709192]"
detectedLanguagesNormalised = {ArrayList@1360}  size = 2
 0 = {DetectedLanguage@1372} "DetectedLanguage[zh-CN:0.5714290624145583]"
 1 = {DetectedLanguage@1373} "DetectedLanguage[vi:0.4285689612751914]"

It looks like for #86, Japanese requires normalising the detection input using language profile creation normalization rules. However, from these examples it can obviously have quite a detrimental affect on other detection....

Will try to get insight by debugging

For all of these cases, with and without normalization, I'm seeing >99% confidence for zh-CN when force short-text language detection:

        List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
        LanguageDetector detector = LanguageDetectorBuilder.create(NgramExtractors.standard())
                .withProfiles(languageProfiles)
                .shortTextAlgorithm(2000) // strings shorter than this use short text detection
                .build();

Note how the mixed examples were >50 chars, and so are classified as long text with this detector by default.

For the 153 char long string [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道,这里聚集了潮牌的精品店,所以各位喜欢潮牌的男士一定不能错过!本期采访潮男搭配:Lad Musician上衣;WEGO裤子;Dr.Martens鞋子。, :

  • detection on original string took 12ms

  • detection on normalized string took 1ms (maybe something had already loaded into the detector so it was a bit snappier)

I can't comment on how much slower this makes things for genuinely long strings. Maybe it's best to break it into samples of short strings, and analyse those?

Maybe the limit for short/long text detection is just too low (should it be, say, ten times larger in order to become accurate enough?), so we see these poor results. The documentation could be better - I'm not sure what's going on in places, or what some of the variables are.

I also can't comment on wider affect of accuracy. It's definitely worth testing further. For these cases the detection certainly looks much better.