Segmentation of combined emojis
RazrFalcon opened this issue ยท 1 comments
RazrFalcon commented
for c in UnicodeSegmentation::graphemes("๐ณ๏ธโ๐", true) {
println!("{}", c);
}
Outputs:
๐ณ๏ธโ
๐
๐ณ๏ธโ
๐
But should output:
๐ณ๏ธโ๐
๐ณ๏ธโ๐
Another example: ๐ฎโโ
.
Is it UnicodeSegmentation
bug or am I doing this wrong? For my current task this should be a single "character".
Manishearth commented
We're operating off an old unicode version (9) where that's not in the tables.
https://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
Filed #43
That may take a while to fix, but it may be worth updating to Unicode 10 in the interim (which is an easier update than 10 to 11), and will also fix your issue.