Emoji after "ใ" causes StringIndexOutOfBoundsException
ciffelia opened this issue ยท 1 comments
ciffelia commented
Description
When trying to analyze text containing an emoji after a "ใ", java.lang.StringIndexOutOfBoundsException
is thrown.
Environment
- Sudachi 0.4.2
- system_core.dic @ 20200330
- OpenJDK 8u252
Steps to reproduce
$ echo "ใ๐" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
at java.lang.String.substring(String.java:1963)
at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)
kazuma-t commented
I reproduced it. I'll look into it.