optimaize/language-detector

Mixed language strange results (one is clearly more dominant).

Opened this issue · 3 comments

Example text:

设为首页收藏本站 开启辅助访问 为首页收藏本站 开启辅助访为首页收藏本站 开启辅助访切换到窄版 请 登录 后使用快捷导航 没有帐号 注册 用户名 Email 自动登录  找回密码 密码 登录  注册 快捷导航 论坛BBS 导读Guide 排行榜Ranklist 淘帖Collection 日志Blog 相册Album 分享Share 搜索 搜索 帖子 用户 公告

Chinese is clearly more dominant than English, but it can't detect Chinese at all and "getProbabilities()" returns this list:

[DetectedLanguage[en:0.8571376011773154], DetectedLanguage[fr:0.14286031717254952]]

French? Have no idea where it sees French.

If I remove the end (with those English words) it does detect the Chinese language fine.

I don't think that few English words in such a dominant Chinese text should give such a false result.

Yes this is not good, thanks for the submission.
You can solve it by changing the defaults.

Visually, the text looks as if it would have way more Han than Latin characters. Those blocks are so dominant compared to the small Latin letters with few strokes. However, the counts are not so clear:

Han = 105
Latin = 45 (!)
(plus 31 spaces, which are script agnostic)

The project's front page explains in the section "How to Use" that text should be run through the text object factory:

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
TextObject textObject = textObjectFactory.forText("my text");

However: because the text given in the example has such a high rate of secondary script content, it is not removed. The removal is performed by the RemoveMinorityScriptsTextFilter and the example above needs a value >= 0.43, the default of 0.3 is not enough.

Something like this will do for you:

new TextObjectFactoryBuilder()
                .withTextFilter(UrlTextFilter.getInstance())
                .withTextFilter(RemoveMinorityScriptsTextFilter.forThreshold(0.5))
                .build();

While this may solve your problem for the moment, I fail to see right now why this does not work out of the box, why it thinks English is more dominant. It needs debugging. So thanks for reporting it.

For what it's worth, with current git master the example returns the list:
[DetectedLanguage[vi:0.8571398047407237], DetectedLanguage[zh:0.14285646388985637]]

Vietnamese profile has a mix of Latin and Han, which seems to fit the provided text better than Chinese profile. I believe the profiles need a cleanup.

Interesting to see about secondary script filtering @fabiankessler
In #63 I found that forcing short-text language detection also helped a lot for similar mixes of >50 char Chinese/English. However, I'm not sure exactly of the implications of forcing short-text language detection.
At this moment I'm not even sure if the short and long text algorithms use the same language models. The algorithms differ quite a lot at least.