charWidth gives incorrect result for emoji
Closed this issue · 12 comments
Emoji are supposed to be displayed as 2 characters wide, apparently since Unicode 9. However, here they are treated as 1 character wide.
Here is a list of emoji in Unicode 14 (https://unicode.org/emoji/charts/full-emoji-list.html). Things can get pretty ugly with zero-width combiners, but we can probably improve on the current situation.
c.f. https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1665140
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9
Testing:
🙃👿
abcd
Our function for retrieving character widths is pretty simple; it relies on ranges.
Is there a dedicated emoji range we can test for?
Unfortunately it looks like the answer is no. Emoji (ignoring zero-width joiners for now) are defined as the code points listed in this file, which is highly non-contiguous.
The blocks that have some emoji in them seem to be:
'\x1200' to '\x2328' -> 1
'\x232B' to '\x2E31' -> 1
'\x1D000' to '\x1F1FF' -> 1
'\x1F200' to '\x1F251' -> 2 -- has emoji, but are already 2 characters wide
'\x1F300' to '\x1F773' -> 1
Perhaps the most extensible solution is to write a parser for these specification files, have them generate a list of emoji ranges, and then use Template Haskell to generate an isEmoji :: Char -> Bool
function.
There is also a list of code points which have a width of 1, but become emoji (and hence width 2) when followed by the unicode variation selector 16, FE0F. Dealing with these and zero-width joiners may not be possible with a range-based approach.
Well, the data should already be in my emojis package; we could depend on that. Unfortunately, as you point out, lots of emojis use multiple code points (not just FE0F either; look at the national flags, for instance).
Oh, very nice! Yes, depending on that would work. As a first pass, we could just filter the emoji list for those consisting of a single code point and test against that.
The more general problem is that it seems emoji (and character width in general) cannot be detected using a character-by-character approach, and at least some state needs to be maintained. Is dealing that within the scope of doclayout
?
Sure, I think it's in scope. We have a function realLength
which does a left fold over characters (i.e. code points). This could be modified to keep track of state from earlier characters. (IN case you want to take a look.)
What are your opinions and policies on pulling in extra dependencies? Emoji recognition seems like it might be a good fit for a bytestring trie.
Edit: It's probably not worth pulling in the dependency. We can get similar results using nested Map
s.
An unrelated question about the emojis
library: is there a specific reason you get the emoji list from gemoji rather than from the unicode specification itself, or is it just simpler?
Partly because it's easier and partly because we want to support the standard aliases GitHub uses, which I don't think are found in the Unicode spec.
Whatever changes we make, we have to make sure performance is decent, because this width calculation gets run on every literal doclayout processes. Nested Maps might work well enough.