WorksApplications/Sudachi

Emoji after "ใ€‚" causes StringIndexOutOfBoundsException

ciffelia opened this issue ยท 1 comments

Description

When trying to analyze text containing an emoji after a "ใ€‚", java.lang.StringIndexOutOfBoundsException is thrown.

Environment

  • Sudachi 0.4.2
  • system_core.dic @ 20200330
  • OpenJDK 8u252

Steps to reproduce

$ echo "ใ€‚๐Ÿ˜€" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
        at java.lang.String.substring(String.java:1963)
        at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
        at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
        at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
        at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
        at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)

I reproduced it. I'll look into it.