n8willis/opentype-shaping-documents

[Indic] Dotted circle placement in broken syllables

adrianwong opened this issue · 9 comments

On encountering a broken syllable, HarfBuzz inserts a dotted circle at the start of the syllable (or after a "Repha") and shapes it as if it were a standalone syllable.

However, on analysing the regex, we can see that a standalone syllable's dotted circle comes after a possible "Reph":

reph = (Ra H | Repha);

standalone_cluster = ((Repha|CS)? PLACEHOLDER | reph? DOTTEDCIRCLE).n? complex_syllable_tail;
broken_cluster     =                            reph?               n? complex_syllable_tail;

Should the dotted circle always be inserted after a possible "Reph" such that it is by definition a standalone syllable? ***

Also, when shaping syllables that begin with a possible "Reph", inserting a circle before the sequence yields output that looks a bit peculiar (to my eyes at least). Here are two examples:

"Ra, Halant, Halant, Ka" (Lohit Bengali)

Would the dotted circle in HarfBuzz's current output:

62513124-1cbb8a00-b85e-11e9-83b7-83da7f277cea

convey the missing base consonant better if it looked like this instead? (This was achieved by inserting the dotted circle after the "Ra, Halant".)

62513127-1f1de400-b85e-11e9-9fa5-806986bf1201

"Ra, Halant, Sign E" (Lohit Bengali)

Here, Harfbuzz marks the "Ra, Halant" sequence as post-base, resulting in the "Ra" taking on subjoined form. While this only happens with Indic1 fonts, it might be misleading:

62513075-e251ed00-b85d-11e9-8b61-e8c6d30259d0

If the dotted circle is inserted after the "Ra, Halant":

62513092-f1d13600-b85d-11e9-949d-51f156494f42

*** Where a change like this would fall short is that it doesn't handle:

  1. REPH_MODE_EXPLICIT scripts. Inserting a dotted circle in between a "Halant" and "ZWJ" would inhibit the formation of "Reph".
  2. Syllable-initial "Ra, Halant, ZWJ" sequences in REPH_MODE_IMPLICIT scripts. A "ZWJ" in this context prohibits "Reph" formation, but by inserting the dotted circle in between the "Halant" and "ZWJ", we are permitting it.

Both cases can be resolved by inserting the dotted circle after the "ZWJ", but that would mean once again deviating from the definition of a standalone syllable. Is the standalone syllable regex overly restrictive perhaps?

Theoretically, a dotted circle glyph should be inserted (wherever the base glyph is expected to be; after GSUB so it does not mess with real characters, but before GPOS so mark positioning is active) for every dependent sign that is formed on its own (without an encoded base)—that is:

  • atomically encoded (thus naturally definitively encoded and is definitively formed on its own) dependent signs (eg, U+09C7 ে BENGALI VOWEL SIGN E),
  • and non-atomically but definitively encoded signs (eg, Sinhala <U+0DBB ර SINHALA LETTER RAYANNA, U+0DCA ් SINHALA SIGN AL-LAKUNA, ZWJ> and <U+1039 ္ MYANMAR SIGN VIRAMA, U+1000 က MYANMAR LETTER KA>, but not <U+09B0 র BENGALI LETTER RA, U+09CD ্ BENGALI SIGN VIRAMA>). (Note Devanagari, etc’s explcitly encoded half forms might be a necessary opted-out special case.)

None of the four figures looks right.

  1. REPH_MODE_EXPLICIT scripts. Inserting a dotted circle in between a "Halant" and "ZWJ" would inhibit the formation of "Reph".

The dotted circle should be inserted only after a dependent sign (here a repha) is formed on its own (without an encoded base).

  1. Syllable-initial "Ra, Halant, ZWJ" sequences in REPH_MODE_IMPLICIT scripts. A "ZWJ" in this context prohibits "Reph" formation, but by inserting the dotted circle in between the "Halant" and "ZWJ", we are permitting it.

Both cases can be resolved by inserting the dotted circle after the "ZWJ", but that would mean once again deviating from the definition of a standalone syllable. Is the standalone syllable regex overly restrictive perhaps?

The standalone syllable regex seems to have overlooked ZWJ’s participation in repha formation. Generally speaking, ZWJ’s effect should always be analyzed alongside a neighbor virama.


I’m aware that my recommendation is different from HarfBuzz’s current practice for sequences like “ে্”. But I prefer the USE spec’s more predictable behavior of inserted dotted circles.

Thanks heaps for your feedback, @lianghai!

The example sequences are really valuable in allowing me to follow your reasoning.

I agree HB's insertion logic is inadequate. Should jump over possible reph as well. And yes, the grammar should also allow ZWJ for explicit reph... We cannot modify the Indic grammar based on the script though.

That said, it feels to me like Ra,Halant by itself is a complete syllable. No?

Would be nice if we can agree on something and adjust HB as well.

cc @dscorbett

Thanks for weighing in, @behdad.

That said, it feels to me like Ra,Halant by itself is a complete syllable. No?

I would say so too. From observation, Uniscribe does seem to treat them as such. Using the same two examples in the original post above:

uniscribe-1
uniscribe-2

(Judging by these examples and some others, I assume that Uniscribe's current dotted circle insertion strategy is that which is described in the USE spec.)

<ra, virama> is a perfectly valid akshara on its own (and is used in real text). It should not be interfered when an inserted dotted circle (note there’s a reason why a dotted circle character is not there in the string in the first place—a base for repha is not intended to be there) can make it join following characters’s cluster.

A note that what I said earlier—

… a dotted circle glyph should be inserted (… after GSUB … but before GPOS …) …

—is likely wrong. See MicrosoftDocs/typography-issues#281 for a discussion on this matter.

When a cluster starts with any character that has UGC=Mc or UGC=Mn, USE inserts a dotted circle glyph (U+25CC) to indicate a broken cluster. Defective clusters do not form extended clusters themselves. A sequence of marks without a valid base forms separate clusters for each mark. Note that an explicit character U+25CC is a valid generic base (GB, BASE_OTHER) and so can form extended clusters.

If one were to implement dotted circle insertion per USE's recommendations, before reordering but after normalisation, would double dotted-circles on decomposed matras be considered acceptable?

Take, for example, <U+09CB ো BENGALI VOWEL SIGN O>, which decomposes into <U+09C7 ে BENGALI VOWEL SIGN E, U+09BE া BENGALI VOWEL SIGN AA>. If I take the spec's recommendation to "form separate clusters for each mark", this would mean I'd get <Sign E, Dotted Circle> and <Dotted Circle, Sign Aa> post-reorder.

I think it's going to be necessary to insert an explicit discussion of the U+25CC issue. If I can finish up merging some of the remaining WIP changes I'll do that and it will be easier to judge the wording in context.

There is some in-progress work to sort out this issue in PR #121 for those who want to take a look. Very much expect it to change; at present it's only a framework, but it does attempt to call attention to some of the issues mentioned in this thread. Mostly, I just want to know if anybody thinks it is the wrong places to start mentioning the dotted-circle insertion progress; script-specific stuff is still to come.