cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding

Question

cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding

xianpingge opened this issue 8 years ago · 2 comments

I'm dumping the glyphs from HanaMinB.ttf
( available at

https://osdn.net/frs/redir.php?m=pumath&f=%2Fhanazono-font%2F64385%2Fhanazono-20160201.zip

), where most of the characters are > U+FFFF.

Enclosed please find the output of
ttfdump -t cmap HanaMinB.ttf

According to the ttfdump output, this ttf file contains 4 cmap
subtables, covering the 4 encodings defined in truetype.go:

unicodeEncoding = 0x00000003 // PID = 0 (Unicode), PSID = 3 (Unicode 2.0)
microsoftSymbolEncoding = 0x00030000 // PID = 3 (Microsoft), PSID = 0 (Symbol)
microsoftUCS2Encoding = 0x00030001 // PID = 3 (Microsoft), PSID = 1 (UCS-2)
microsoftUCS4Encoding = 0x0003000a // PID = 3 (Microsoft), PSID = 10 (UCS-4)

And the current code selects the first one (unicodeEncoding):

pidPsid := u32(table, offset)
// We prefer the Unicode cmap encoding. Failing to find that, we fall
// back onto the Microsoft cmap encoding.
if pidPsid == unicodeEncoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
break
} else if pidPsid == microsoftSymbolEncoding ||
pidPsid == microsoftUCS2Encoding ||
pidPsid == microsoftUCS4Encoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
// We don't break out of the for loop, so that Unicode can override Microsoft.
}

and none of the >U+FFFF characters are available.

Should we prefer microsoftUCS4Encoding to the
16-bit-only unicodeEncoding ?

HanaMinB.ttf-dump-cmap.txt

Answer 1 · 2016-12-21T22:52:19.000Z

Yeah, we should probably prefer microsoftUCS4Encoding.

Answer 2 · 2016-12-21T23:31:38.000Z

An alternative is to also accept PID = 0 (Unicode), PSID = 4 (Unicode 2.0, full repertoire, i.e. not restricted to the Basic Multilingual Plane). FWIW, ttx shows me 5 cmap subtables, not 4, for HanaMinB.ttf.

Also, the code as is prefers Unicode to Microsoft cmap encodings, but I can't remember the reason why, and maybe we don't need to. We should probably prefer cmap format 12 tables over cmap format 4, though, for the greater (non-BMP) range.