optimaize/language-detector

TextObjectFactory changes text

danielnaber opened this issue · 2 comments

This test fails:

    TextObjectFactory textObjectFactory  = new TextObjectFactoryBuilder().maxTextLength(1000).build();
    String inp = "一体日本人は生きるということを知っているだろうか。";
    String shortText = textObjectFactory.forText(inp).toString();
    assertEquals(inp, shortText);

Output:

org.junit.ComparisonFailure: 
Expected :一体日本人は生きるということを知っているだろうか。
Actual   :一万日三人あ三ああああああああ之ああああああああ。

I guess the issue is in TextObject.append().

I don't understand why com.optimaize.langdetect.cybozu.util.CharNormalizer#normalize does this:

        } else if (block == Character.UnicodeBlock.HIRAGANA) {
            ch = '\u3042';

i.e. all characters of a Unicode block are mapped to a single character?

Hi @danielnaber , please take a look at #86 (comment)

Basically: I think it's because hiragana/katakana are unique to Japanese (and similar for Hangul symbols being unique Korean, etc.), so it's to try and compress the models. I expect that the "compressed" models perform similarly to full models, but this is a guess that's not backed by data!

It seems that the action to take is to manually normalise your input text for detection (so the munged text actually finds matches in the big ngram map).