[Bug]: "Reformat HTML" doesn't process MathML tags correctly
firstrose opened this issue · 11 comments
Reformat html does not ever change whitespace in a file unless absolutely required.
Are you talking about Prettify?
The choices under Reformat HTML are 1) Mend and 2) Mend and Prettify.
I'm fairly certain we ignore the contents of math tags whenever Prettifying. Probably Mending, too, to be honest.
Hmm .... gumbo supports and recognizes mathml tags, and Prettify uses gumbo so it should handle mathml theoretically. But it might not pretty it up the way it does other code.
I will run a few tests to see.
Okay, tested this and Prettify changes no mathml.
The problem is a mathml tag could be either inline or a block tag so you can not know if it is okay to insert whitespace or not.
It would really help if you could provide as complete as example as possible of a "before" and "after" for mathml "properly indented" so to speak.
I will look up which if any mathml tags are block level only and so are safe to add whitespace to and which are inline only and should not have any whitespace changed.
But a visual example and test case would certainly help.
Okay, tested this and Prettify changes no mathml.
The problem is a mathml tag could be either inline or a block tag so you can not know if it is okay to insert whitespace or not.
It would really help if you could provide as complete as example as possible of a "before" and "after" for mathml "properly indented" so to speak.
OK
It's the problem. I haven't thought about this.
MathML use attrib "display" to detmine whether it's inline.
<math display="block">
I don't know whether Prettify/gumbo can support this.
Here is a testcase.
There are two snippets. The first is inline, and the other is not.
Thank you.
For the record I found the following rules about whitespace for mathml in the spec.
In MathML, as in XML, "whitespace" means simple spaces, tabs, newlines, or
carriage returns, i.e., characters with hexadecimal Unicode codes
U+0020, U+0009, U+000A, or U+000D, respectively; see also the
discussion of whitespace in Section 2.3 of [XML].
MathML ignores whitespace occurring outside token elements.
Non-whitespace characters are not allowed there. Whitespace occurring
within the content of token elements , except for <cs>, is normalized as
follows. All whitespace at the beginning and end of the content is removed,
and whitespace internal to content of the element is collapsed canonically,
i.e., each sequence of 1 or more whitespace characters is replaced with one
space character (U+0020, sometimes called a blank character).
...
For example, <mo> ( </mo> is equivalent to <mo>(</mo>, and
<mtext>
Theorem
1:
</mtext>
is equivalent to <mtext>Theorem 1:</mtext> or <mtext>Theorem 1:</mtext>.
Authors wishing to encode white space characters at the start or end of
the content of a token, or in sequences other than a single space, without
having them ignored, must use (U+00A0) or other non-marking
characters that are not trimmed. For example, compare the above use
of an mtext element with:
<mtext>
 <!--NO-BREAK SPACE-->Theorem  <!--NO-BREAK SPACE-->1:
</mtext>
When the first example is rendered, there is nothing before "Theorem",
one Unicode space character between "Theorem" and "1:", and nothing
after "1:". In the second example, a single space character is to be
rendered before "Theorem"; two spaces, one a Unicode space character
and one a Unicode no-break space character, are to be rendered
before "1:"; and there is nothing after the "1:".
Note that the value of the xml:space attribute is not relevant in this
situation since XML processors pass whitespace in tokens to a MathML
processor; it is the requirements of MathML processing which specify that
whitespace is trimmed and collapsed.
For whitespace occurring outside the content of the token elements
mi, mn, mo, ms, mtext, ci, cn, cs, csymbol and annotation, an mspace element
should be used, as opposed to an mtext element containing only whitespace
entities.
Given the above: all non-token mathml tags should be treated just like "block level" tags in that a return can be added immediately after and then indented.
Do you agree with that interpretation?
Fundmentally I agree with you.
But there is one point left to consider.
The spec allows tags being indented, while it's not forbidden to strip "whitespace" and make tags "inline".
MathML can be rendered as a block, or inline. For a bulk of "inline rendered" tags, should them be formatted as "inline"?
For example.
this unformatted code rendered as
Of course it should be indented after formatting.
While
gives
Should the tags be formatted as "inline", like this?
<span><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ</span>
Of course, you decide.
The real question in both cases with and without display="block" how it appears if the whitespace spec rules are used by pretty printing. As long as the result/look on screen stays the same in both cases, we should be good to go.
I will run a few tests.
Using this test code:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
<title></title>
</head>
<body>
<!-- display is block -->
<p>look <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ.</p>
<p> </p>
<!-- display is block and prettified by spec whitespace rules -->
<p>look
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
<mi>a</mi><mo>=</mo><mi>b</mi>
</math>
is an equ.</p>
<p>---- ----</p>
<!-- no display block -->
<p>look <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ.</p>
<p> </p>
<!-- no display block but prettified by spec whitespace rules -->
<p>look
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>a</mi><mo>=</mo><mi>b</mi>
</math>
is an equ.</p>
</body>
</html>
So pretty printing the mathml code following the spec has no impact on the produced image. Only the display="block" does and the fact it is there or not does not impact how to prettify the code (if you follow the mathml whitespace rules).
Do you agree?
Based on the above, just added support for prettifying mathml code in Sigil.
Closing this as fixed in master.