Split text into script runs
brawer opened this issue · 2 comments
Currently, fontdiff asks ICU for the most likely script code given the user-supplied BCP47 language tag, and passes this to HarfBuzz. Instead, fontdiff needs to determine script runs. For an example how to do this, see the function _raqm_resolve_scripts().
To reproduce, render this snippet with NotoSansCJK.ttc:
<html>
<div>
<span lang="ja">今</span>
<span lang="zh-Hani">今</span>
<span lang="zh-Hans">今</span>
<span lang="zh-Hant">今</span>
<span lang="ko">今</span>
</div>
</html>
Currently, it looks like this:
The first two letterforms are correct. However, all Chinese letterforms should have the shape of the second glyph, i.e. with a diagonal instead of horizontal stroke in the middle.
fontdiff now produces the same shapes with NotoSansCJK as prescribed by the Unicode standard. For example, U+4ECA 今 has a diagonal stroke for language zh-Hans
, and a horizontal stroke otherwise. Obviously I didn’t test all glyphs, just a few ones. Note that Wikipedia is giving different shapes than Unicode, and Unicode is obviously more trustworthy.
G: General Chinese · H: Chinese in Hong Kong · T: Chinese in Taiwan · J: Japanese · K: Korean · V: Vietnamese. There’s more, and also there’s Unicode Variations which I did not test for this bug because they’d have to be handled by the font, and a rendering engine like fontdiff doesn’t need to do anything special about them.
Unicode is obviously more trustworthy.
Unicode can be pretty idiosyncratic sometimes. Of course, I have no idea about CJK.