UnicodeMath implementation reference
znjameswu opened this issue · 3 comments
Here is a dedicated issue for the implementation of UnicodeMath spec.
After some quick experiment, I find some quite burning discrepencies in the UnicodeMath. The following discrepencies will directly affect parser design and behavior. I will document them within this issue and the docs.
Current Plan
There are two areas of UnicodeMath input support currently under consideration: user input, or a string input. The plan is to support both by treating string input as a stream of user character inputs (with the cursor fixed on end). (This behavior is not guaranteed by the spec itself, but it seems good.)
The UnicodeMath spec itself proposed a reference parsing algorithm in Appendix A. However there are some conflicts in the UnicodeMath spec.
Conflicts and Decision
Spec-wise, this reference algorithm deviates from the spec:
- Handling of the unary operators (Align with reference)
- '+^2', 'a^-2' as subsup? (Spec says no. Reference says yes.)
Also, MS Word's implementation deviates from the reference:
- Handling of the unary operators (remains to be decided)
- '1/+2', '+1/2' as fraction? (Reference disallows both. MS Word permits the first one.)
- Differentiation between the unary plus and unary minus (remains to be decided)
- 'a^-1' vs 'a^+1' as subsup? (Reference only allows the first. MS Word allows both.)
- Differentiation between / and other operators in handling unary operators (remains to be decided)
- '1/+2' as fraction, 'a\of+b' as function, and '\sqrt+2' as sqrt? (Reference says no no no. MS Word permits the first.)
Reference algorithm proposed by spec
- char ← Unicode character
- space ← ASCII space (U+0020)
- αASCII ← ASCII A-Z a-z
- nASCII ← ASCII 0-9
- αnMath ← Unicode math alphanumeric (U+1D400 – U+1D7FF with some Letterlike symbols U+2102 – U+2134)
- αnOther ← Unicode alphanumeric not including αnMath nor nASCII
- αn ← αnMath | αnOther
- diacritic ← Unicode combining mark
- opArray ← ‘&’ | VT | ‘■’
- opClose ← ‘)’ | ‘]’ | ‘}’ | ‘⌍’
- opCloser ← opClose | “\close”
- opDecimal ← ‘.’ | ‘,’
- opHbracket ← Unicode math horizontal bracket
- opNary ← Unicode integrals, summation, product, and other nary ops
- opOpen ← ‘(’ | ‘[’ | ‘{’ | ‘⌌’
- opOpener ← opOpen | “\open”
- opOver ← ‘/’ | “\atop”
- opBuildup ← ‘_’ | ‘^’ | ‘√’ | ‘∙’ | ‘√’ | ‘□’ | ‘/’ | ‘|’ | opArray | opOpen | opClose | opNary | opOver | opHbracket | opDecimal
- other ← char – {αn + nASCII + diacritic + opBuildup + CR}
- diacriticbase ← αn | nASCII | ‘(’ exp ‘)’
- diacritics ← diacritic | diacritics diacritic
- atom ← αn | diacriticbase diacritics
- atoms ← atom | atoms atom
- digits ← nASCII | digits nASCII
- number ← digits | digits opDecimal digits
- expBracket ← opOpener exp opCloser
← ‘||’ exp ‘||’
← ‘|’ exp ‘|’ - word ← αASCII | word αASCII
- scriptbase ← word | word nASCII | αnMath | number | other | expBracket | opNary
- soperand ← operand | ‘∞’ | ‘-’ operand | “-∞”
- expSubsup ← scriptbase ‘’ soperand ‘^’ soperand | scriptbase ‘^’ soperand ‘’ soperand
- expSubscript ← scriptbase ‘_’ soperand
- expSuperscript ← scriptbase ‘^’ soperand
- expScript ← expSubsup | expSubscript | expSuperscript
- entity ← atoms | expBracket | number
- factor ← entity | entity ‘!’ | entity “!!” | function | expScript
- operand ← factor | operand factor
- box ← ‘□’ operand
- hbrack ← opHbracket operand
- sqrt ← ‘√’ operand
- cubert ← ‘∙’ operand
- fourthrt ← ‘√’ operand
- nthrt ← “√(” operand ‘&’ operand ‘)’
- function ← sqrt | cubert | fourthrt | nthrt | box | hbrack
- numerator ← operand | fraction
- fraction ← numerator opOver operand
- row ← exp | row ‘&’ exp
- rows ← row | rows ‘@’ row
- array ← “\array(” rows ‘)’
- element ← fraction | operand | array
- exp ← element | exp other element
Current intended solution is to specialize '/' and '_', '^', so that they automatically raise the precedence of the unary-able operators behind them. I still need more testing on MS Word
Found a really good project to be a reference https://github.com/doersino/UnicodeMathML
After going through its implementation. I think the following solution might work for this project:
- Modify
petitparser
so that it can takeList<GreenNode>
as input. - Non-
SymbolNode
orSymbolNode
marked asescaped
will be viewed as primitives. Non-escapedSymbolNode
will be recognized as operators by the parser. - In conversion mode which takes a
String
as input, we translate theString
intoList<SymbolNode>
, and run the conversion-mode parser in forward order. This parser variant will be exhaustive and will try to parse everything till the end. (This will highly resemble UnicodeMathML project mentioned above) - In interactive mode which takes user input, we only supply a sublist of children of the current active
EquationRowNode
as the input to the parser. This parser variant will try to parse all input into one term, given operator precedence requirement.- We starting by giving sublist in the cursorPos-2:cursorPos range.
- If parser fails, decrease the lower bound by 1 and retry.
- If parser succeeds, use parsing results to modify the children list. Decrease the lower bound by 1 and try again.
- The loop terminates when the lower bound crosses zero.
- (This is not a backtracking process in regular expression! Everytime the parser succeeds, the input gets modified!)
The rationale is that:
- It is quite clear that we can't straight up use forward parsing for interactive mode (we should only built one term on the right end). Also, forward parsing sometimes would just give counter-intuitive (and non-MS-Office-compliant) results (e.g.
\int_i^j_i^j
evaluates to(\int_i^j)_i^j
in forward parsing) - The conversion mode needs to work in forward order. There are certain cases the interactive-mode parsing would just produce spec-breaking results (e.g.
\int_i^j_i^j a
evaluates to\int_i^(j_i^j) a
in backward parsing)
Why not let the parser parse backward for the interactive mode?
- Advantage: No backtracking. Start from the cursor position and matches leftward. No loop involved.
- Disadvantage: It seems MS Office disagrees with backward parsing (See separator ambiguity test results below)
Some other ambiguities related to separators (Tested in MS Word):
(1|2|3)
evaluates to a bracket containing1
,|2|
,3
(Documented in the spec)(1||)
evaluates to(
,1
,|
,|\box )
(1|2|)
evaluates to a bracket containing1
,|2|
(1|2|3|4)
evaluates to the form of(A|B)
whereA=1|2|3
andB=4
(This comes as a surprise. Backward parsing shouldn't produce this)