charWidth gives incorrect result for emoji

Question

charWidth gives incorrect result for emoji

Closed this issue 3 years ago · 12 comments

Emoji are supposed to be displayed as 2 characters wide, apparently since Unicode 9. However, here they are treated as 1 character wide.

Here is a list of emoji in Unicode 14 (https://unicode.org/emoji/charts/full-emoji-list.html). Things can get pretty ugly with zero-width combiners, but we can probably improve on the current situation.

c.f. https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1665140
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9

Answer 1 · 2021-10-04T15:40:27.000Z

Testing:

🙃👿
abcd

Answer 2 · 2021-10-04T15:44:35.000Z

Our function for retrieving character widths is pretty simple; it relies on ranges.
Is there a dedicated emoji range we can test for?

Answer 3 · 2021-10-04T21:47:48.000Z

Unfortunately it looks like the answer is no. Emoji (ignoring zero-width joiners for now) are defined as the code points listed in this file, which is highly non-contiguous.

Answer 4 · 2021-10-05T03:02:56.000Z

The blocks that have some emoji in them seem to be:

'\x1200' to '\x2328'   -> 1
'\x232B' to '\x2E31'   -> 1
'\x1D000' to '\x1F1FF' -> 1
'\x1F200' to '\x1F251' -> 2  -- has emoji, but are already 2 characters wide
'\x1F300' to '\x1F773' -> 1

Perhaps the most extensible solution is to write a parser for these specification files, have them generate a list of emoji ranges, and then use Template Haskell to generate an isEmoji :: Char -> Bool function.

There is also a list of code points which have a width of 1, but become emoji (and hence width 2) when followed by the unicode variation selector 16, FE0F. Dealing with these and zero-width joiners may not be possible with a range-based approach.

Answer 5 · 2021-10-05T04:31:53.000Z

Well, the data should already be in my emojis package; we could depend on that. Unfortunately, as you point out, lots of emojis use multiple code points (not just FE0F either; look at the national flags, for instance).

Answer 6 · 2021-10-05T05:10:45.000Z

Oh, very nice! Yes, depending on that would work. As a first pass, we could just filter the emoji list for those consisting of a single code point and test against that.

The more general problem is that it seems emoji (and character width in general) cannot be detected using a character-by-character approach, and at least some state needs to be maintained. Is dealing that within the scope of doclayout?

Answer 7 · 2021-10-05T05:56:02.000Z

Sure, I think it's in scope. We have a function realLength which does a left fold over characters (i.e. code points). This could be modified to keep track of state from earlier characters. (IN case you want to take a look.)

Answer 8 · 2021-10-05T06:46:25.000Z

What are your opinions and policies on pulling in extra dependencies? Emoji recognition seems like it might be a good fit for a bytestring trie.

Edit: It's probably not worth pulling in the dependency. We can get similar results using nested Maps.

Answer 9 · 2021-10-05T10:38:14.000Z

An unrelated question about the emojis library: is there a specific reason you get the emoji list from gemoji rather than from the unicode specification itself, or is it just simpler?

Answer 10 · 2021-10-05T15:59:52.000Z

Partly because it's easier and partly because we want to support the standard aliases GitHub uses, which I don't think are found in the Unicode spec.

Answer 11 · 2021-10-05T16:38:51.000Z

Whatever changes we make, we have to make sure performance is decent, because this width calculation gets run on every literal doclayout processes. Nested Maps might work well enough.

Answer 12 · 2021-10-10T20:43:53.000Z

Closed by PR #4.