Surrogate pair not properly handled in SudachiSplitFilter
sorami opened this issue ยท 0 comments
sorami commented
In SudachiSpliterFilter, an OOV token will also have per-character output in extended mode.
However, the "characters" are handled as char
array, which causes a problem when there are surrogate pairs.
For example, when the input text is "๐"
, there will be 3 tokens
"๐"
String.valueOf("๐".toCharArray()[0])
String.valueOf("๐".toCharArray()[1])
(Possibly that the similar problem exists outside this filter too?)