jgm/doclayout

charWidth gives incorrect result for emoji

Closed this issue · 12 comments

Emoji are supposed to be displayed as 2 characters wide, apparently since Unicode 9. However, here they are treated as 1 character wide.

Here is a list of emoji in Unicode 14 (https://unicode.org/emoji/charts/full-emoji-list.html). Things can get pretty ugly with zero-width combiners, but we can probably improve on the current situation.

c.f. https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1665140
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9

jgm commented

Testing:

🙃👿
abcd
jgm commented

Our function for retrieving character widths is pretty simple; it relies on ranges.
Is there a dedicated emoji range we can test for?

Unfortunately it looks like the answer is no. Emoji (ignoring zero-width joiners for now) are defined as the code points listed in this file, which is highly non-contiguous.

The blocks that have some emoji in them seem to be:

'\x1200' to '\x2328'   -> 1
'\x232B' to '\x2E31'   -> 1
'\x1D000' to '\x1F1FF' -> 1
'\x1F200' to '\x1F251' -> 2  -- has emoji, but are already 2 characters wide
'\x1F300' to '\x1F773' -> 1

Perhaps the most extensible solution is to write a parser for these specification files, have them generate a list of emoji ranges, and then use Template Haskell to generate an isEmoji :: Char -> Bool function.

There is also a list of code points which have a width of 1, but become emoji (and hence width 2) when followed by the unicode variation selector 16, FE0F. Dealing with these and zero-width joiners may not be possible with a range-based approach.

jgm commented

Well, the data should already be in my emojis package; we could depend on that. Unfortunately, as you point out, lots of emojis use multiple code points (not just FE0F either; look at the national flags, for instance).

Oh, very nice! Yes, depending on that would work. As a first pass, we could just filter the emoji list for those consisting of a single code point and test against that.

The more general problem is that it seems emoji (and character width in general) cannot be detected using a character-by-character approach, and at least some state needs to be maintained. Is dealing that within the scope of doclayout?

jgm commented

Sure, I think it's in scope. We have a function realLength which does a left fold over characters (i.e. code points). This could be modified to keep track of state from earlier characters. (IN case you want to take a look.)

What are your opinions and policies on pulling in extra dependencies? Emoji recognition seems like it might be a good fit for a bytestring trie.

Edit: It's probably not worth pulling in the dependency. We can get similar results using nested Maps.

An unrelated question about the emojis library: is there a specific reason you get the emoji list from gemoji rather than from the unicode specification itself, or is it just simpler?

jgm commented

Partly because it's easier and partly because we want to support the standard aliases GitHub uses, which I don't think are found in the Unicode spec.

jgm commented

Whatever changes we make, we have to make sure performance is decent, because this width calculation gets run on every literal doclayout processes. Nested Maps might work well enough.

jgm commented

Closed by PR #4.