sile-typesetter/sile

MathML and Unicode invisible operators

Opened this issue · 1 comments

Unicode defines a few "invisible" operators:

  • U+2061 Function application (⁡)
  • U+2062 Invisible times (⁢)
  • U+2063 Invisible separator (⁣)
  • U+2064 Invisible plus (&InvisiblePlus;)

MathML Core doesn't mention anything special about these, apart from listing them in its appendices.

MathML4, however, has a few words to say in §3.1.1: "... they usually render invisibly ... but may influence visual spacing."
Note the ill-defined "usually" and "may" in that specification. After so many years, there's still an unaddressed blind spot.
Despite that, there are a few test cases in Joe Javawaski's Browser Test and the MathML3 Test Suite that use these operators, as well as other sources...

Consider $f(x)$, and $\cos \theta$. In MathML, this could be written as:

  <mi>f</mi>
  <mo>&ApplyFunction;</mo>
  <mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow>
  ...
  <mi>cos</mi>
  <mo>&ApplyFunction;</mo>
  <mi>&theta;</mi>

In the first case, the function application operator should not be not rendered, but in the second case, one expects it to be rendered as spacing, unless other provisions are made.

What do I mean by "other provisions"? Let's first check the invisible times operator, for $a b$ and $\cos \theta \cos \phi$:

    <mi>a</mi>
    <mo>&InvisibleTimes;</mo>
    <mi>b</mi>
    ...
    <mi>cos</mi>
    <mo>&ApplyFunction;</mo>
    <mi>&theta;</mi>
    <mo>&InvisibleTimes;</mo>
    <mi>cos</mi>
    <mo>&ApplyFunction;</mo>
    <mi>&phi;</mi>

In the first case, the invisible times operator should not be rendered (implicit multiplication), but in the second case, it should be rendered as spacing between the two cosine functions.

Do you start seeing the problem?

  • Trying various MathML renderers (native or not), I can't make sense of how they handle these operators, or their absence. All I can say at a glance is that interpretations are inconsistent...
  • The specification is so vague that people often tweak the MathML lspace and rspace attributes to get the desired effect...
  • The MathML specifications describes an algorithm for spacing, but it's not clear how these operators fit in, and more generally does not seem to explain what we see in the wild.
  • SILE, anyhow, doesn't implement the MathML spacing algorithm, and relies on TeX's spacing rules (based on atom types)
  • To make it worse, math fonts may or may not have glyphs for these operators, and even if they do, they may not be designed to be invisible (e.g. having some width): I haven't found a consistent specification for this matter either.

Earlier I mentioned "other provisions". It would seem (without looking at the code), for instance, that MathJax does some magic distinguishing between one-letter and multi-letter <mi>identifiers, adding spacing in the latter case. It's my guess based on observation, at best. It's as if the multi-letter <mi> identifiers are treated as "operators" with some spacing.
It's not totally insane, if it's what it does: it's how I implemented the cos, sin, etc. functions in SILE in #2167 for the TeX-like syntax (as mo with the operator atom type -- note this is also what TeX does with \mathop in its implementation of these functions).

All of this is nice and dandy, but it doesn't tell us what to do for MathML documents.

  • Invisible separator (a.k.a. invisible comma) can likely be completely ignored.
  • Invisible plus is a mystery to me, I've no idea what it's supposed to do and where...
  • Invisible times and invisible function application are the most problematic:
    • Sometimes they should be ignored?
    • Sometimes they should be rendered as spacing?
    • Or alter the previous element?
    • Explicit lspace and rspace on them may need to be honored, even though SILE currently ignores them on other operators (using, as stated above, it's atom type to determine spacing).

My head hurts 🐰

And I've just scratched the surface.
One could also want to mention the indirect use of U+200B, the zero-width space in this extract of the MathML Test Suite ("Torture Tests", complex1, simplified here for the sake of brevity and your own sanity):

  <mo>&nabla;</mo>
  <mtext>&#x200B;</mtext>
  <mo>&times;</mo>
  <mi>B</mi>

The intent of this dubious use of U+200B in an mtext element might have been to cancel some spacing introduced by the "nabla" as operator, instead of making it an identifier (?)

Those folks are really pushing the envelope, bordering on the absurd, but it's all in the name of testing a totally insane and complex standard, so it's all good, right?