Refactor grapheme cluster segmentation to properly act on clusters with more than 2 codepoints
Opened this issue · 0 comments
christianparpart commented
https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
Specifically I am interested in correctly segmenting a consecutive list of country flags (RI regional indicators).
Also, to make the future implementation (but also the current one) very fast, we
should add the grapheme tokens (CR, LF, L, V, LV, LVT, Extend, ZWJ, Control, SpacingMark, Prepend, Extended_Pictographic, RI) as a field to the new codepoint_properties table to ensure grapheme segmentation is as efficient as possible.