mbutterick/quad

fall back to system fonts for missing characters, like emoji

Closed this issue ยท 18 comments

(currently writing a small document and got a tofu where there should be an emoji)

Please post the source code that produces the wrong output.

#lang quadwriter/markdown

 ๐Ÿ˜Ž

I thought it was just that there was no support for this character; I don't need need it at all!

You can specify your own fonts, but that wonโ€™t solve this problem. Emoji are not included in most fonts. In general, the PDF compiler needs to a) notice that it needs a character that isnโ€™t in the current font and b) fall back to the system font. For now I suggest you resort to the good old fashioned ASCII emoticons ;)

When you said โ€œtofuโ€ I thought you meant an actual emoji that looks like tofu. Iโ€™ve now learned that โ€œtofuโ€ describes the box-shaped missing-character .notdef glyph.

One can find out if the glyph is included in the font by reading the cmap table in the font.

One can display a missing glyph (incl emoji) by keeping a giant font on hand for such substitutions, for instance Noto.

The detection needs to happen during the PDF render, where the underlying rendering commands might look like so:

[font doc "foo-font"]
[text doc "string to be printed" 100 100]

These commands assume that the characters in the "string to be printed" all correspond to glyphs in foo-font. But that might not be so:

[font doc "foo-font"]
[text doc "emoji ๐Ÿ˜Ž to be printed" 100 100]

To implement font fallback, the text command would have to

  1. Query the cmap of foo-font to see if glyphs exist that correspond to each character. (This assumes conventional behavior by the font. For instance, itโ€™s possible that none of these individual characters correspond to glyphs, but that, say, each word in the string gets shaped into a ligature by the gsub table. But this is too moronic to contemplate.)

  2. If a glyph exists for each character, process the text command the same as usual.

  3. If any glyph is missing, separate the string into multiple smaller strings and process the missing characters individually with a fallback font, so that the above command becomes something like:

[font doc "foo-font"]
[text doc "emoji" 100 100]
[font doc "emoji-fallback-font"]
[text doc "๐Ÿ˜Ž" 1xx 100]
[font doc "foo-font"]
[text doc "to be printed" 1xx 100]

Actually as far as quad is concerned, this glyph-checking needs to happen even sooner, because it affects layout.

Having said all that, maybe emojis are indeed a special case, because they are pretty much guaranteed not to be supported by user fonts, so maybe they should just be picked out from the input stream and formatted specially.

I looked at both Noto emoji fonts (color and monochrome). The color font is apparently not made of outlines but rather little color PNG images that my font library doesnโ€™t know how to handle. The mono font is missing some essential TTF tables and has also apparently been abandoned.

In sum, if there is an OFL-licensed, outline-based monochrome emoji font out there, I think I can at least fix the emoji part. If someone wants to post a link, great. Finding this font is not a priority homework assignment for me.

In sum, if there is an OFL-licensed, outline-based monochrome emoji font out there, I think I can at least fix the emoji part. If someone wants to post a link, great. Finding this font is not a priority homework assignment for me.

just for the record, the only font I found that seems to have an outline-based fallback was https://github.com/eosrei/twemoji-color-font (which is based on https://github.com/twitter/twemoji). the outline font I seem to have installed is Symbola, but that one restricts its commercial usage.

but yeah, this is not a priority for me either! sorry about the tofu confusion from early.

Actually, I tried just regenerating the Noto mono emoji font from source and was able to make it work. Now I need a Racket regular expression that detects emoji.

(regexp-match #rx"[\U1f600-\U1f64f]" "๐Ÿ˜")

Now I just need a list of emoji codepoint ranges.

apparently one also needs to take grapheme clusters into account for emoji like ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ:

https://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text

although in text-mode they don't seem to work -- ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ & ๐Ÿ™…๐Ÿฝ in emacs:

image

so this can be ignored.

https://en.wikipedia.org/wiki/Emoji#Unicode_blocks seems to have the ranges!

The authoritative source is the Unicode consortium, which has published the v12.0 emoji list here. I understand how to handle the first set, that have a single codepoint:

231A..231B    ; Basic_Emoji              ; watch                                                          #  1.1  [2] (โŒš..โŒ›)
23E9..23EC    ; Basic_Emoji              ; fast-forward button                                            #  6.0  [4] (โฉ..โฌ)
23F0          ; Basic_Emoji              ; alarm clock                                                    #  6.0  [1] (โฐ)
23F3          ; Basic_Emoji              ; hourglass not done                                             #  6.0  [1] (โณ)

But then we have the ones with variation selectors:

00A9 FE0F     ; Basic_Emoji              ; copyright                                                      #  3.2  [1] (ยฉ๏ธ)
00AE FE0F     ; Basic_Emoji              ; registered                                                     #  3.2  [1] (ยฎ๏ธ)
203C FE0F     ; Basic_Emoji              ; double exclamation mark                                        #  3.2  [1] (โ€ผ๏ธ)
2049 FE0F     ; Basic_Emoji              ; exclamation question mark                                      #  3.2  [1] (โ‰๏ธ)
2122 FE0F     ; Basic_Emoji              ; trade mark                                                     #  3.2  [1] (โ„ข๏ธ)

I suppose one is allowed to ignore the variation selector and just show the default version of the emoji (given by the first codepoint in the list)? So a pattern that matches the copyright emoji would be x00A9 optionally followed by xFE0F, which would just be simplifed to x00A9 alone?

Are you regretting asking about emoji? ๐Ÿ˜€ In typesetting, one must always be careful what one wishes for โ€ฆ

I made a little #lang that treats the text of the Unicode specification as the source code for an emoji? function. I still need to sort out what to do about the modifier bytes.

As I think about it longer, though emoji is a special case of the missing-glyph problem, they should all be handled the same way:

  1. Early in the typesetting process, quadwriter has a list of every individual character and the font that will be used for each.

  2. That is the best moment to go through and check that each of those characters is in the selected font.

  3. If the glyph exists in the font, skip step (4).

  4. If the glyph doesn't exist in the font, change the font to a fallback font (which would be one of two fonts โ€” an emoji font, or a regular font, depending on the character that needs substution.

  5. Then the rest of the quadwriter process can happen normally.

this project seems to do something similar in emacs (https://github.com/rolandwalker/unicode-fonts). maybe it can be useful as a reference.

This should work now. I havenโ€™t tested the fallback-font mechanism with every possible glyph, however, so if you find glyphs that donโ€™t work, please post another issue.

PS the output of the test file:

#lang quadwriter/markdown

 ๐Ÿ˜Ž

Should look like this:

Screen Shot 2019-05-14 at 9 27 14 PM

(enlarged for artistic effect)

it worked great, thanks! I'm making an effort of using quad whenever I can ๐Ÿ‘