WorksApplications/elasticsearch-sudachi

Surrogate pair not properly handled in SudachiSplitFilter

sorami opened this issue ยท 0 comments

In SudachiSpliterFilter, an OOV token will also have per-character output in extended mode.

However, the "characters" are handled as char array, which causes a problem when there are surrogate pairs.

For example, when the input text is "๐‘‡", there will be 3 tokens

  1. "๐‘‡"
  2. String.valueOf("๐‘‡".toCharArray()[0])
  3. String.valueOf("๐‘‡".toCharArray()[1])

(Possibly that the similar problem exists outside this filter too?)