Sigil-Ebook/Sigil

[Bug]: "Reformat HTML" doesn't process MathML tags correctly

firstrose opened this issue · 11 comments

Bug Description

BNrYF2Hhgl

MathML tags are not formatted with indentation.

Platform (OS)

Windows (Default)

OS Version / Specifics

10.0.19045.3324

What version of Sigil are you using?

2.0.1

Any backtraces or crash reports

No response

Reformat html does not ever change whitespace in a file unless absolutely required.

Are you talking about Prettify?

The choices under Reformat HTML are 1) Mend and 2) Mend and Prettify.

I'm fairly certain we ignore the contents of math tags whenever Prettifying. Probably Mending, too, to be honest.

Hmm .... gumbo supports and recognizes mathml tags, and Prettify uses gumbo so it should handle mathml theoretically. But it might not pretty it up the way it does other code.

I will run a few tests to see.

Okay, tested this and Prettify changes no mathml.

The problem is a mathml tag could be either inline or a block tag so you can not know if it is okay to insert whitespace or not.

It would really help if you could provide as complete as example as possible of a "before" and "after" for mathml "properly indented" so to speak.

I will look up which if any mathml tags are block level only and so are safe to add whitespace to and which are inline only and should not have any whitespace changed.

But a visual example and test case would certainly help.

Okay, tested this and Prettify changes no mathml.

The problem is a mathml tag could be either inline or a block tag so you can not know if it is okay to insert whitespace or not.

It would really help if you could provide as complete as example as possible of a "before" and "after" for mathml "properly indented" so to speak.

OK

It's the problem. I haven't thought about this.
MathML use attrib "display" to detmine whether it's inline.
<math display="block">
I don't know whether Prettify/gumbo can support this.

Here is a testcase.

mathtest.zip

There are two snippets. The first is inline, and the other is not.

Thank you.

For the record I found the following rules about whitespace for mathml in the spec.

In MathML, as in XML, "whitespace" means simple spaces, tabs, newlines, or 
carriage returns, i.e., characters with hexadecimal Unicode codes 
U+0020, U+0009, U+000A, or U+000D, respectively; see also the 
discussion of whitespace in Section 2.3 of [XML].

MathML ignores whitespace occurring outside token elements. 
Non-whitespace characters are not allowed there. Whitespace occurring 
within the content of token elements , except for <cs>, is normalized as 
follows. All whitespace at the beginning and end of the content is removed, 
and whitespace internal to content of the element is collapsed canonically, 
i.e., each sequence of 1 or more whitespace characters is replaced with one 
space character (U+0020, sometimes called a blank character).
...
For example, <mo> ( </mo> is equivalent to <mo>(</mo>, and

<mtext>
  Theorem
  1:
</mtext>
is equivalent to <mtext>Theorem 1:</mtext> or <mtext>Theorem&#x20;1:</mtext>.

Authors wishing to encode white space characters at the start or end of 
the content of a token, or in sequences other than a single space, without 
having them ignored, must use &nbsp; (U+00A0) or other non-marking 
characters that are not trimmed. For example, compare the above use 
of an mtext element with:

<mtext>
&#xA0;<!--NO-BREAK SPACE-->Theorem &#xA0;<!--NO-BREAK SPACE-->1: 
</mtext> 

When the first example is rendered, there is nothing before "Theorem", 
one Unicode space character between "Theorem" and "1:", and nothing 
after "1:". In the second example, a single space character is to be 
rendered before "Theorem"; two spaces, one a Unicode space character 
and one a Unicode no-break space character, are to be rendered 
before "1:"; and there is nothing after the "1:".

Note that the value of the xml:space attribute is not relevant in this 
situation since XML processors pass whitespace in tokens to a MathML 
processor; it is the requirements of MathML processing which specify that 
whitespace is trimmed and collapsed.

For whitespace occurring outside the content of the token elements 
mi, mn, mo, ms, mtext, ci, cn, cs, csymbol and annotation, an mspace element 
should be used, as opposed to an mtext element containing only whitespace 
entities.

Given the above: all non-token mathml tags should be treated just like "block level" tags in that a return can be added immediately after and then indented.

Do you agree with that interpretation?

Fundmentally I agree with you.

But there is one point left to consider.

The spec allows tags being indented, while it's not forbidden to strip "whitespace" and make tags "inline".

MathML can be rendered as a block, or inline. For a bulk of "inline rendered" tags, should them be formatted as "inline"?

For example.

yeq6vnKl8g

this unformatted code rendered as

Klf40NpW6Y

Of course it should be indented after formatting.

While

GYtRhBu0Du

gives

PZ9eVbQ47E

Should the tags be formatted as "inline", like this?

<span><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ</span>

Of course, you decide.

The real question in both cases with and without display="block" how it appears if the whitespace spec rules are used by pretty printing. As long as the result/look on screen stays the same in both cases, we should be good to go.

I will run a few tests.

Using this test code:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
  <title></title>
</head>

<body>

<!-- display is block --> 
<p>look <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ.</p>
 
<p>&#160;</p>

<!-- display is block and prettified by spec whitespace rules --> 
<p>look 
    <math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
        <mi>a</mi><mo>=</mo><mi>b</mi>
    </math>
 is an equ.</p>

<p>----&#160;----</p>

<!-- no display block -->
<p>look <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mi>b</mi></math> is an equ.</p>

<p>&#160;</p>

<!-- no display block but prettified by spec whitespace rules -->
<p>look 
    <math xmlns="http://www.w3.org/1998/Math/MathML">
        <mi>a</mi><mo>=</mo><mi>b</mi>
    </math>
 is an equ.</p>
 
</body>
</html>
preview_results

So pretty printing the mathml code following the spec has no impact on the produced image. Only the display="block" does and the fact it is there or not does not impact how to prettify the code (if you follow the mathml whitespace rules).

Do you agree?

Based on the above, just added support for prettifying mathml code in Sigil.
Closing this as fixed in master.