n8willis/opentype-shaping-documents

Add errata document

Opened this issue · 22 comments

We ought to document all the shaping-related errata we (reasonably) can as well as describing "how it should work". Let's gather those here for the moment, with an eye towards calling the initial set complete in a couple of weeks.

So far:

  • The GSUB spec says that MultipleSubst cannot be used to delete a glyph, it always substitutes at least one replacement glyph, but some implementations allow the replacement glyph array to be zero-length.

  • We have several Uniscribe-specific compatibility bugs listed in
    opentype-shaping-documents/notes/uniscribe-bug-compatibility.md

  • The spec is ambiguous about which adjacent-mark sequences need reordering, as per #34 (comment)

The scriptListOffset, featureListOffset, and lookupListOffset fields in the GSUB/GPOS header may be NULL, despite the spec only suggesting that featureVariationsOffset may be NULL.

I personally prefer the direction that any offset may be NULL...

A NULL offset is at least a clear indication of a missing value, although it makes more sense for values which are explicitly optional.

The weird situation we've encountered with some fonts is very small but non-zero offsets, like "4", which aren't big enough to point outside of the current struct and clearly don't point to a valid value if you try to follow them. I'm not sure how font creation tools manage to make such mistakes.

I can imagine an offset value of 4 pointing to two 0 bytes to be a valid encoding of an empty array...

And offset to empty array can be encoded as NULL. No need to have wording to allow it.

Yes if it points to a valid value that's fine, although a little weird, but the reason we noticed in the first place is because it wasn't valid 😆

Something I wonder is whether an offset to a zero-sized object is allowed to point outside the file 🤔

Something I wonder is whether an offset to a zero-sized object is allowed to point outside the file 🤔

We don't allow that. I mean. We go ahead and "fix" it by rewriting the offset with NULL, so doesn't make a difference.

Something I wonder is whether an offset to a zero-sized object is allowed to point outside the file 🤔

We don't allow that. I mean. We go ahead and "fix" it by rewriting the offset with NULL, so doesn't make a difference.

And by we, I meant in HarfBuzz.

Another thing that would be good to clarify is the way nested contextual lookups use their own lookup flag, but other lookups within a contextual lookup use the parent's lookup flag. I don't recall if the spec has anything to say about nested contextual lookups; it's also helpful to know whether the child's context can extend beyond the parent's, let alone the weirdness of using child MultilpleSubst to delete the parent's context!

Also GSUB lookups must be sorted by lookup index before being applied, but as I recall GPOS lookups must not?

Another thing that would be good to clarify is the way nested contextual lookups use their own lookup flag, but other lookups within a contextual lookup use the parent's lookup flag.

I think in harfbuzz we always use the child's flags. Do you have a test case that can reveal this?

I don't recall if the spec has anything to say about nested contextual lookups; it's also helpful to know whether the child's context can extend beyond the parent's,

@litherum reports that Windows does not allow that, while HarfBuzz and CoreText do.
https://twitter.com/Litherum/status/1103911322872307715

let alone the weirdness of using child MultilpleSubst to delete the parent's context!

Deleting parent's context is no different from ligating parent's context, which is one of the examples in AFDKO feature file format (matching "ffi" then a child ligating f+f, then other child ligating ff+i).

Also GSUB lookups must be sorted by lookup index before being applied, but as I recall GPOS lookups must not?

GPOS is mostly additive. I don't know what Windows does. But HarfBuzz sorts them. The spec clearly says lookups are applied in their numeric order. Of course there's the per-script lists...

I think in harfbuzz we always use the child's flags. Do you have a test case that can reveal this?

A quick test reveals that the Amiri fonts break if we use the child's lookup flag.

GPOS is mostly additive. I don't know what Windows does. But HarfBuzz sorts them. The spec clearly says lookups are applied in their numeric order. Of course there's the per-script lists...

Hmm I just tried sorting vs. not-sorting them and got the same results each time; I got different results with our old implementation, but that must have been due to a bug. 🤓

I think in harfbuzz we always use the child's flags. Do you have a test case that can reveal this?

A quick test reveals that the Amiri fonts break if we use the child's lookup flag.

(I really hope I've tested this correctly...)

Test font: Amiri v. 000.109
Test sequence: U+0646 (Letter Noon), U+0652 (Sukun), U+0628 (Letter Beh)

Lookup i: 139 (chaining contextual, lookup flag: 8 (ignore marks)) specifies a child lookup i: 109 (single, lookup flag: 0). Using the child's lookup flag appears to inhibit the substitution of glyph 2341 -> 3219, resulting in an output that looks like this:

Screen Shot 2019-03-26 at 10 53 39 am

as opposed to this, which uses the parent's lookup flag (this is how it looks with HarfBuzz/CoreText):

Screen Shot 2019-03-26 at 10 55 06 am

Lookup i: 139 (chaining contextual, lookup flag: 8 (ignore marks)) specifies a child lookup i: 109 (single, lookup flag: 0). Using the child's lookup flag appears to inhibit the substitution of glyph 2341 -> 3219, resulting in an output that looks like this:

Screen Shot 2019-03-26 at 10 53 39 am

as opposed to this, which uses the parent's lookup flag (this is how it looks with HarfBuzz/CoreText):

Screen Shot 2019-03-26 at 10 55 06 am

That doesn't make sense. Why would a IgnoreMarks lookupflag inhibit a single substitution? I checked HarfBuzz code again, we definitely use the child lookup flag.

I don’t get this either. The lookups in question are basically:

lookup BaaNonIsol {                                                             
  sub @aBaa.init by @aBaa.init_BaaNonIsol;                                      
  sub @aNon.fina by @aNon.fina_BaaNonIsol;                                      
} BaaNonIsol;

lookup BaaNonIsolCalt {                                                                  
  lookupflag IgnoreMarks;
  sub [@aBaa.init]' lookup BaaNonIsol                                           
      [@aNon.fina]' lookup BaaNonIsol;
} BaaNonIsolCalt;

The contextual substitution lookup has IgnoreMarks flag as it should, so that “U+0646 U+0628” sequence would match regardless of any intervening marks. The single substitution lookup does not have IgnoreMarks flag as it woldn’t make any difference as it applies to single input glyph, no marks would be in the input to ignore or not.

BTW, your input would give the output you show only if the text was processed LTR, not sure if this was intentional, but I’d make sure Arabic text is tested in RTL direction as LTR don’t always give the expected output (and might give different results in different implementations).

BTW, your input would give the output you show only if the text was processed LTR, not sure if this was intentional, but I’d make sure Arabic text is tested in RTL direction as LTR don’t always give the expected output (and might give different results in different implementations).

Sorry! This was unintentional on my part. We do test Arabic text in RTL, but when I was writing up my findings I somehow got it in my head that I needed to specify the input in reverse 🤦‍♂️.

That doesn't make sense. Why would a IgnoreMarks lookupflag inhibit a single substitution? I checked HarfBuzz code again, we definitely use the child lookup flag.

The contextual substitution lookup has IgnoreMarks flag as it should, so that “U+0646 U+0628” sequence would match regardless of any intervening marks. The single substitution lookup does not have IgnoreMarks flag as it woldn’t make any difference as it applies to single input glyph, no marks would be in the input to ignore or not.

Thank you for your responses! Makes sense.

Another thing that would be good to clarify is the way nested contextual lookups use their own lookup flag, but other lookups within a contextual lookup use the parent's lookup flag.

This was some confusion on my part, I was mixing up the use of the parent's lookup flag and the child's in a way that happened to make the tests pass so I never realised. 😆

We've simplified the code now and it makes much more sense, thanks for your patience.

GPOS is mostly additive. I don't know what Windows does. But HarfBuzz sorts them. The spec clearly says lookups are applied in their numeric order. Of course there's the per-script lists...

So, is there an ambiguity regarding the per-script lists?

GPOS is mostly additive. I don't know what Windows does. But HarfBuzz sorts them. The spec clearly says lookups are applied in their numeric order. Of course there's the per-script lists...

So, is there an ambiguity regarding the per-script lists?

Following up on this, my guess would be that this means it's ambiguous how to sort the GPOS lookups that are script-tagged with the GPOS lookups that are generic/default(dflt?). Is that the concern?

If there's something here, I'll add it to errata.

  • Noting the nested-contextual-lookups issue detailed in allsorts #25.