UnicodeMath implementation reference

Question

UnicodeMath implementation reference

znjameswu opened this issue 4 years ago · 3 comments

Here is a dedicated issue for the implementation of UnicodeMath spec.

After some quick experiment, I find some quite burning discrepencies in the UnicodeMath. The following discrepencies will directly affect parser design and behavior. I will document them within this issue and the docs.

Current Plan

There are two areas of UnicodeMath input support currently under consideration: user input, or a string input. The plan is to support both by treating string input as a stream of user character inputs (with the cursor fixed on end). (This behavior is not guaranteed by the spec itself, but it seems good.)

The UnicodeMath spec itself proposed a reference parsing algorithm in Appendix A. However there are some conflicts in the UnicodeMath spec.

Conflicts and Decision

Spec-wise, this reference algorithm deviates from the spec:

Handling of the unary operators (Align with reference)
- '+^2', 'a^-2' as subsup? (Spec says no. Reference says yes.)

Also, MS Word's implementation deviates from the reference:

Handling of the unary operators (remains to be decided)
- '1/+2', '+1/2' as fraction? (Reference disallows both. MS Word permits the first one.)
Differentiation between the unary plus and unary minus (remains to be decided)
- 'a^-1' vs 'a^+1' as subsup? (Reference only allows the first. MS Word allows both.)
Differentiation between / and other operators in handling unary operators (remains to be decided)
- '1/+2' as fraction, 'a\of+b' as function, and '\sqrt+2' as sqrt? (Reference says no no no. MS Word permits the first.)

Reference algorithm proposed by spec

char ← Unicode character
space ← ASCII space (U+0020)
αASCII ← ASCII A-Z a-z
nASCII ← ASCII 0-9
αnMath ← Unicode math alphanumeric (U+1D400 – U+1D7FF with some Letterlike symbols U+2102 – U+2134)
αnOther ← Unicode alphanumeric not including αnMath nor nASCII
αn ← αnMath | αnOther

diacritic ← Unicode combining mark
opArray ← ‘&’ | VT | ‘■’
opClose ← ‘)’ | ‘]’ | ‘}’ | ‘⌍’
opCloser ← opClose | “\close”
opDecimal ← ‘.’ | ‘,’
opHbracket ← Unicode math horizontal bracket
opNary ← Unicode integrals, summation, product, and other nary ops
opOpen ← ‘(’ | ‘[’ | ‘{’ | ‘⌌’
opOpener ← opOpen | “\open”
opOver ← ‘/’ | “\atop”
opBuildup ← ‘_’ | ‘^’ | ‘√’ | ‘∙’ | ‘√’ | ‘□’ | ‘/’ | ‘|’ | opArray | opOpen | opClose | opNary | opOver | opHbracket | opDecimal
other ← char – {αn + nASCII + diacritic + opBuildup + CR}

diacriticbase ← αn | nASCII | ‘(’ exp ‘)’
diacritics ← diacritic | diacritics diacritic
atom ← αn | diacriticbase diacritics
atoms ← atom | atoms atom
digits ← nASCII | digits nASCII
number ← digits | digits opDecimal digits

expBracket ← opOpener exp opCloser
← ‘||’ exp ‘||’
← ‘|’ exp ‘|’
word ← αASCII | word αASCII
scriptbase ← word | word nASCII | αnMath | number | other | expBracket | opNary
soperand ← operand | ‘∞’ | ‘-’ operand | “-∞”
expSubsup ← scriptbase ‘’ soperand ‘^’ soperand | scriptbase ‘^’ soperand ‘’ soperand
expSubscript ← scriptbase ‘_’ soperand
expSuperscript ← scriptbase ‘^’ soperand
expScript ← expSubsup | expSubscript | expSuperscript

entity ← atoms | expBracket | number
factor ← entity | entity ‘!’ | entity “!!” | function | expScript
operand ← factor | operand factor
box ← ‘□’ operand
hbrack ← opHbracket operand
sqrt ← ‘√’ operand
cubert ← ‘∙’ operand
fourthrt ← ‘√’ operand
nthrt ← “√(” operand ‘&’ operand ‘)’
function ← sqrt | cubert | fourthrt | nthrt | box | hbrack
numerator ← operand | fraction
fraction ← numerator opOver operand

row ← exp | row ‘&’ exp
rows ← row | rows ‘@’ row
array ← “\array(” rows ‘)’

element ← fraction | operand | array
exp ← element | exp other element

Answer 1 · 2020-07-28T04:27:35.000Z

Current intended solution is to specialize '/' and '_', '^', so that they automatically raise the precedence of the unary-able operators behind them. I still need more testing on MS Word

Answer 2 · 2020-12-30T01:29:05.000Z

Found a really good project to be a reference https://github.com/doersino/UnicodeMathML

After going through its implementation. I think the following solution might work for this project:

Modify petitparser so that it can take List<GreenNode> as input.
Non-SymbolNode or SymbolNode marked as escaped will be viewed as primitives. Non-escaped SymbolNode will be recognized as operators by the parser.
In conversion mode which takes a String as input, we translate the String into List<SymbolNode>, and run the conversion-mode parser in forward order. This parser variant will be exhaustive and will try to parse everything till the end. (This will highly resemble UnicodeMathML project mentioned above)
In interactive mode which takes user input, we only supply a sublist of children of the current active EquationRowNode as the input to the parser. This parser variant will try to parse all input into one term, given operator precedence requirement.
- We starting by giving sublist in the cursorPos-2:cursorPos range.
- If parser fails, decrease the lower bound by 1 and retry.
- If parser succeeds, use parsing results to modify the children list. Decrease the lower bound by 1 and try again.
- The loop terminates when the lower bound crosses zero.
- (This is not a backtracking process in regular expression! Everytime the parser succeeds, the input gets modified!)

The rationale is that:

It is quite clear that we can't straight up use forward parsing for interactive mode (we should only built one term on the right end). Also, forward parsing sometimes would just give counter-intuitive (and non-MS-Office-compliant) results (e.g. \int_i^j_i^j evaluates to (\int_i^j)_i^j in forward parsing)
The conversion mode needs to work in forward order. There are certain cases the interactive-mode parsing would just produce spec-breaking results (e.g. \int_i^j_i^j a evaluates to \int_i^(j_i^j) a in backward parsing)

Why not let the parser parse backward for the interactive mode?

Advantage: No backtracking. Start from the cursor position and matches leftward. No loop involved.
Disadvantage: It seems MS Office disagrees with backward parsing (See separator ambiguity test results below)

Answer 3 · 2020-12-30T01:43:50.000Z

Some other ambiguities related to separators (Tested in MS Word):

(1|2|3) evaluates to a bracket containing 1, |2|, 3 (Documented in the spec)
(1||) evaluates to (, 1, |, |\box )
(1|2|) evaluates to a bracket containing 1, |2|
(1|2|3|4) evaluates to the form of (A|B) where A=1|2|3 and B=4 (This comes as a surprise. Backward parsing shouldn't produce this)