n8willis/opentype-shaping-documents

[Indic] Categorise numbers as "placeholders"

adrianwong opened this issue · 11 comments

Per this section in the spec, we categorise Indic numbers as "other", where "other" is not in any of the syllable identification regex rules, and is thus omitted from the shaping process.

If we're not planning on changing our regex rules to include "other", I think we should re-categorise Indic numbers as "placeholder" instead, which is what HarfBuzz does.

The motivation here is that certain fonts (e.g. Noto Sans/Serif Devanagari) do apply the LOCL feature to numbers when a language tag is used (e.g. Marathi), so they shouldn't be omitted from the shaping process.

Having said that, I'm not entirely sure how much of this request would already be invalidated by the note at the bottom of Section 3 (i.e. if localised-form substitution is not technically part of the shaping process, then there is no point requesting that Indic numbers be recategorised in order to "enable their shaping").

Digits often act as the base in an akshara, which is the main reason why InSC = Number characters are part of the USE class BASE. Digits are merely shorthands for writing numeral words, anyway, and thus may bear, say, a suffix just like how a word is suffixed, and the suffix can often be a vowel sign or something else. OTL locl doesn’t need to be the major argument here.

Per this section in the spec, we categorise Indic numbers as "other", where "other" is not in any of the syllable identification regex rules, and is thus omitted from the shaping process.

If we're not planning on changing our regex rules to include "other", I think we should re-categorise Indic numbers as "placeholder" instead, which is what HarfBuzz does.

The motivation here is that certain fonts (e.g. Noto Sans/Serif Devanagari) do apply the LOCL feature to numbers when a language tag is used (e.g. Marathi), so they shouldn't be omitted from the shaping process.

Makes sense. Same with @lianghai 's comment (especially regarding shorthand substitutions).

Having said that, I'm not entirely sure how much of this request would already be invalidated by the note at the bottom of Section 3 (i.e. if localised-form substitution is not technically part of the shaping process, then there is no point requesting that Indic numbers be recategorised in order to "enable their shaping").

I think it's a good change. The note regarding locl was one of those bits that was inserted to smooth over the rough mismatches between "what HarfBuzz considers shaping" and "what the MS docs describe as shaping".... So, perhaps just as much 'verbal sleight of hand' as anything else. It might make sense to note those (in comment-elements within the Markdown source or in an issue) because relying on memory isn't robust.

Should be uncontroversial; a change that takes care of this is in #112. I did not see any other instances where a similar change should be made (e.g., in other Brahmi-derived scripts), but if there are any, just say the word.

Looks alright.

But, another clarification to make: the need of manipulating digit glyphs (among others) under an Indic OTL script tag, should be and merely be one more reason for segmenting digit characters into Indic OTL script runs. OTL features and lookups should be applied on such glyphs no matter if they’re part of an Indic cluster.

Another shaping context is fractions and related numeral-forms (numr, dnom, etc.)

But, another clarification to make: the need of manipulating digit glyphs (among others) under an Indic OTL script tag, should be and merely be one more reason for segmenting digit characters into Indic OTL script runs. OTL features and lookups should be applied on such glyphs no matter if they’re part of an Indic cluster.

I'm not positive that I understand what suggestion you're making here. Are you saying that the docs should explicitly discuss how numerals from the script blocks can exist mixed in with all the other text? Or are you saying that the docs should discuss segmenting numerals from the script blocks and numerals from the Latin block? Or, of course, something else.

I certainly gave the numerals terse treatment when writing, but that's mostly because HarfBuzz doesn't do a lot of script-specific stuff with them.

Another shaping context is fractions and related numeral-forms (numr, dnom, etc.)

Mostly these docs have ignored the "non-script-shaping-specific" features (which I guess is synonymous with OTL features) for simplicity. And since higher-level software might have user controls or markup that affect those. I think that frac & friends fall into the "everything else" bucket that HarfBuzz treats the same regardless of any script-specific shaper work. Do you think it's wrong to overlook that?

No, you're right, and number forms do work out-of-the-box anyway, So my comment doesn't make sense :)

@n8willis, I was saying, whether an OTL locl substitution can be applied to Indic digits shouldn’t have anything to do with the OTL Indic cluster model (“syllable identification regex rules”). Eg, even if we continue to categorize Indic digit characters as “other”, those characters should still be affected by whatever non-Indic OTL features (ccmp, locl, calt, among others) registered to an Indic OTL script tag.

@n8willis, I was saying, whether an OTL locl substitution can be applied to Indic digits shouldn’t have anything to do with the OTL Indic cluster model (“syllable identification regex rules”). Eg, even if we continue to categorize Indic digit characters as “other”, those characters should still be affected by whatever non-Indic OTL features (ccmp, locl, calt, among others) registered to an Indic OTL script tag.

Thanks @lianghai, that does make sense. IMHO the footnote in Section 3 should adequately cover this concern, although I will defer to yours and @n8willis's judgement on that.

Decided to merge this in as-posted, since there seems to be agreement that it would close the issue. @lianghai if you feel that the numerals-in-segmentation issue warrants further attention, definitely feel free to open an issue on it (certainly segmentation itself isn't really discussed in any detail). To me it at least seemed distinct enough from this to not hold it up further.