googlefonts/fontdiff

Split text into script runs

brawer opened this issue · 2 comments

Currently, fontdiff asks ICU for the most likely script code given the user-supplied BCP47 language tag, and passes this to HarfBuzz. Instead, fontdiff needs to determine script runs. For an example how to do this, see the function _raqm_resolve_scripts().

To reproduce, render this snippet with NotoSansCJK.ttc:

<html>
  <div>
    <span lang="ja"></span>
    <span lang="zh-Hani"></span>
    <span lang="zh-Hans"></span>
    <span lang="zh-Hant"></span>
    <span lang="ko"></span>
  </div>
</html>

Currently, it looks like this:

The first two letterforms are correct. However, all Chinese letterforms should have the shape of the second glyph, i.e. with a diagonal instead of horizontal stroke in the middle.

fontdiff now produces the same shapes with NotoSansCJK as prescribed by the Unicode standard. For example, U+4ECA 今 has a diagonal stroke for language zh-Hans, and a horizontal stroke otherwise. Obviously I didn’t test all glyphs, just a few ones. Note that Wikipedia is giving different shapes than Unicode, and Unicode is obviously more trustworthy.

unicode-4eca

G: General Chinese · H: Chinese in Hong Kong · T: Chinese in Taiwan · J: Japanese · K: Korean · V: Vietnamese. There’s more, and also there’s Unicode Variations which I did not test for this bug because they’d have to be handled by the font, and a rendering engine like fontdiff doesn’t need to do anything special about them.

Unicode is obviously more trustworthy.

Unicode can be pretty idiosyncratic sometimes. Of course, I have no idea about CJK.