unicode-rs/unicode-segmentation

Segmentation of combined emojis

RazrFalcon opened this issue ยท 1 comments

for c in UnicodeSegmentation::graphemes("๐Ÿณ๏ธโ€๐ŸŒˆ", true) {
    println!("{}", c);
}

Outputs:

๐Ÿณ๏ธโ€
๐ŸŒˆ

๐Ÿณ๏ธโ€
๐ŸŒˆ

But should output:

๐Ÿณ๏ธโ€๐ŸŒˆ

๐Ÿณ๏ธโ€๐ŸŒˆ

Another example: ๐Ÿ‘ฎโ€โ™€.

Is it UnicodeSegmentation bug or am I doing this wrong? For my current task this should be a single "character".

We're operating off an old unicode version (9) where that's not in the tables.

https://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

Filed #43

That may take a while to fix, but it may be worth updating to Unicode 10 in the interim (which is an easier update than 10 to 11), and will also fix your issue.