Preserving clusters with Common script bases in script segmentation
NorbertLindenberg opened this issue · 2 comments
The algorithms documented in Script Segmentation have one problem in common: They do not guarantee to keep clusters (in the OpenType sense) together that have a base whose script property is Common. The most important such bases are U+00A0 NO-BREAK SPACE, which The Unicode Standard (page 60) recommends as the base for showing nonspacing marks in isolation, and U+25CC DOTTED CIRCLE, which is commonly used in code charts or on keyboards for the same purpose.
The Script Segmentation document discusses this issue, but concludes:
The fact that the run breaking algorithm may miscategorise the script of a common character is not a problem unless that character undergoes specific script only styling. If the C characters here should be rendered/shaped differently according to whether they resolve to script A or B, then their correct categorisation becomes important.
I believe this underestimates the problem. The failure to keep such clusters together leads to the following issues:
- OpenType shaping engines or AAT fonts may insert additional dotted circles before the combining marks which reaches them without the base character that the author has already provided. For example, if I write ◌ៀ (that’s U+25CC DOTTED CIRCLE and U+17C0 KHMER VOWEL SIGN IE), you’ll very likely see two dotted circles where the document contains only one. However, this behavior is dependent on both the textual context and specific implementations of OpenType shaping engines and AAT fonts, so it can’t be relied on either if you do want a dotted circle. This issue has been extensively documented by Richard Ishida (“The Combining Character Conundrum”) and by Marc Durdin.
- If a cluster has more than one combining mark, OpenType shaping engines or AAT fonts may break it up even further by inserting a dotted circle before every single mark. This is especially toxic for some Brahmic scripts, where a syllable may include virama characters that are not known outside of Unicode but are used to encode subjoined consonants. For example, if I try to show the subjoined Myanmar consonant wa using ◌္ဝ (that’s U+25CC DOTTED CIRCLE, U+1039 MYANMAR SIGN VIRAMA, U+101D MYANMAR LETTER WA), you’ll very likely see a character ္ that doesn’t occur outside of Unicode, and the consonant ဝ to the right rather than below the base.
- If a font is designed to properly position combining marks relative to the no-break space or dotted circle, or to use wider variants of them to accommodate wider marks, or to raise the base line of all base glyphs to increase the space available for below-base marks, this doesn’t work if the cluster is broken up. (Such features would be the “rendered/shaped differently according to whether they resolve to script A or B” mentioned in the document, and I’ve seen all of them applied in actual fonts.)
To solve this problem, algorithms that segment text into script runs should check for Common script base characters whether they’re followed by combining marks, and, if so, give any script that can be determined from such marks (other than Common or Inherited) preference over any script determined from preceding characters.
The set of Common script characters that should be considered bases for this purpose needs to be determined. A candidate set would be those Common script characters that the Universal Shaping Engine classifies as BASE_OTHER.
A full solution to the issues described above requires similar care in breaking text into font runs (e.g., when using fallback fonts), but let's start here.
Might it be argued that there should be no script change before a character with general category M (that is any M: Mn, Mc, etc.)?
@mhosken I'm not sure what exactly you’re proposing. If it’s a clarification for the phrase “combining mark” in my proposed solution, then yes.
More precisely, for a string
AAACC◌MMM
where A are characters with script 𝑨, C are characters with script Common, ◌ is dotted circle or another Common script base, M are characters with general category M[cen] and script 𝑴, 𝑴≠𝑨, then the script boundary should be placed at
AAACC|◌MMM
instead of
AAACC◌|MMM