whatwg/html-build

Fix MathML/HTML entity divergence

Closed this issue · 36 comments

html5lib/html5lib-tests#71 showed up the fact that MathML defines ⃛, ⃛, ⃜, and ̑ differently to HTML.

This seems to be because we don't match the behaviour of https://github.com/w3c/xml-entities/blob/gh-pages/entities.xsl#L174 (the template starting with <xsl:template match="entity">; note this is XSLT 2 so isn't supported in that many places!) in .entity-processor.py and .entity-processor-json.py (why oh why do we have two different files with so much duplicated code?).

/cc @fred-wang @davidcarlisle

Given https://bugs.webkit.org/show_bug.cgi?id=151562 and https://bugzilla.mozilla.org/show_bug.cgi?id=1223829 this would break what implementations do at the moment.

@hsivonen probably wants to be aware of this, maybe @zcorpan, not sure who else.

One thing to note is that when I worked on https://bugzilla.mozilla.org/show_bug.cgi?id=603716 five years ago, I just copied David Carlisle's http://www.w3.org/2003/entities/2007/htmlmathml-f.ent which does have a space for these entities, although for some reason that space is not preserved by Gecko at the end (why didn't I write a test?).

In httparchive:

SELECT page, COUNT(*) as num
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE REGEXP_MATCH(body, r'&(tdot|TripleDot|DotDot|DownBreve);')
GROUP BY page
ORDER BY num DESC

0 rows.

I tried searching in github but didn't find much other than test cases, and they didn't really check if it could combine with something else. I found https://rawgit.com/operasoftware/presto-testo/79ffdbf678dacee20b93c5b0541dbe978ab14954/core/standards/entity-references/mathml.xhtml which expects a space and Presto passes. Blink has a weird XML parser error, and some failures. Gecko also fails some of these. I don't know if the test is correct...

Testing in http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3764 shows that browsers happily combine &DownBreve; even in XHTML. Including Presto. (Not sure how it passes the test but still combines here?)

I submitted a testcase for html5lib-tests as well as patches for Gecko and WebKit. If the change was not done on purpose and this is just a mistake in the HTML5 specification, I would suggest to align on the XML Entity Definitions for Characters specification. These entities are probably not really important so the changes in implementation should not be a concern. Inconsistencies between the two specifications is (at least a for me) a problem.

I'm more concerned about taking an interoperable state where everyone agrees and turning it into a halfway state lacking interop. That seems much worse than misaligning with an old XML spec.

Implementation interoperability is a concern in general, but in this very particular case that should not matter given the limited usage of these entities. So far nobody provided a justification of why this change was done so it does not seem to be intentional. Actually, by #42 (comment), the html-build script does download the unicode.xml file from this "old" XML spec but generates the wrong output. So that seems a bug/mistake that should be fixed.

Before implementations change it would probably be best to get agreement that the HTML spec will change as it is fairly unambiguous that there should be no space currently in text/html (even if that state was unintentional, that's clearly what it says).

The space was added with MathML in mind specifically because it would have no visual effect (although cause a space in the dom) as the rendering of <mo> &zzz; </mo> trims white space. The argument that the space has no visible effect is rather weaker for general HTML use where these things could conceivably be used in the middle of a text run. So personally I'm relaxed about HTML not having spaces here.

@domenic it's not an "old XML spec" it's a "current XML spec" and the source (although not attributed) of the entity definitions in the HTML spec. If HTML spec is going to change, then I don't need to do anything, but if it is not, what I suggest is that the XML DTD at

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

is modified to have a parameter entity for the space guarding combining characters, then
implementations wanting an HTML-compatible XML definition of these entities just load the file after setting the parameter entity to empty.

So on balance, if implementations are willing to change, and the HTML spec will change, I'd think having spaces is best, but if not, I'll adjust the entities spec to match reality...

FWIW, my personal preference would be to replace these with non-combining equivalent when/if they become available in Unicode. So I believe it's important to agree that the HTML5 spec will keep in sync with the unicode.xml file (as implied by the html-build script). I don't have preference between the space / no-space options right now, although the latter seems more standard in MathML documents.

David's argument is that the space does not make a difference for MathML but this actually does not explain why this space is actually added. The explanation given in the spec is that we do not want to combine with the previous char and we lack a non-combining equivalent. However, I'm not sure this space is really helpful for MathML implementation.

[BTW, I do not like the trimming space rule in MathML because it makes implementation harder (it must be handled for the text rendering, operator dictionary, operator stretching etc) and is not really useful for authors. I hope this will be removed in future version of the spec.]

So are implementors willing to change this?

So are implementors willing to change this?

As said above, patches are ready for WebKit & Gecko but I'm not going to ask a review until the spec authors agree.

OK, it seems to me that we can go either way with this, but HTML and XHTML should be consistent, and if we change the HTML spec we should be confident that everyone will change.

Can you ask the reviewers to chime in here?

I still don't understand why we would make changes here at all. We currently have perfect interop. Patches for WebKit and Gecko covers 2/4 engines, putting us in exactly the worst possible place.

I still don't understand why we would make changes here at all. We currently have perfect interop. Patches for WebKit and Gecko covers 2/4 engines, putting us in exactly the worst possible place.

Can you please put aside interoperability for one second and answer the initial question: was this change from the XML entity specification made on purpose or is it an error introduced by the script used to generate the HTML5 entity list? If it is a mistake, then it makes sense to fix it and the fact that it was propagated to all engines is not really a strong reason to keep the statu quo. If it was done on purpose, then I would suggest that David updates his unicode.xml for consistency with HTML5.

Now again, these entities are not really used in practice and fixing them seems trivial for web engines, so I'm not sure why you focus so much on interoperability here. Do you have an example of popular pages that would be broken by this change? Do you think it would be hard to fix it in e.g. Blink?

I don't know the original intent. However, I disagree:

the fact that it was propagated to all engines is not really a strong reason to keep the status quo

This is an extremely strong reason to keep the status quo, perhaps the strongest possible reason.

On 4 January 2016 at 17:47, Frédéric Wang notifications@github.com wrote:

I still don't understand why we would make changes here at all. We
currently have perfect interop. Patches for WebKit and Gecko covers 2/4
engines, putting us in exactly the worst possible place.

Can you please put aside interoperability for one second and answer the
initial question: was this change from the XML entity specification made on
purpose or is it an error introduced by the script used to generate the
HTML5 entity list? If it is a mistake, then it makes sense to fix it and
the fact that it was propagated to all engines is not really a strong
reason to keep the statu quo. If it was done on purpose, then I would
suggest that David updates his unicode.xml for consistency with HTML5.

Well that''s rather the point and the probable cause of the discrepancy.
unicode.xml has no indication of these spaces guarding combining
characters; they are just added in the xsl processing that generates the
entity files.
the HTML version of the processing isn't written in XSLT (sadly:-)

Originally (MathML1-2 timeframe) there were more combining characters used
and it was useful that the script detected their use, but eventually we
found more non-combining versions (or got Unicode to add them) eventually
just being left with the three or four more or less arcane cases left.

Now again, these entities are not really used in practice and fixing them

seems trivial for web engines, so I'm not sure why you focus so much on
interoperability here. Do you have an example of popular pages that would
be broken by this change? Do you think it would be hard to fix it in e.g.
Blink?

I really don't have a strong opinion on whether the definitions should
change in HTML, any change has some costs.

I agree though that the current situation is somewhat under documented and
that if the implementations don't change the HTML spec ought to call out
that the xml syntax version of the entities differs from the HTML version
in requiring the trailing ; and including a space in these cases. I'd add a
similar note top the entities spec.

If it does change so xml and html both have a space I could change
unicode.xml to encode these explictly as multi-character entities rather
than leave it to the processing to automatically quote combining characters,

David

if the implementations don't change the HTML spec ought to call out
that the xml syntax version of the entities differs from the HTML version
in requiring the trailing ; and including a space in these cases

If browsers don't change, then even in XML there is no space, as far as I can tell. Currently it's just the entities spec that doesn't match implementations.

On 4 January 2016 at 20:41, Simon Pieters notifications@github.com wrote:

if the implementations don't change the HTML spec ought to call out
that the xml syntax version of the entities differs from the HTML version
in requiring the trailing ; and including a space in these cases

If browsers don't change, then even in XML there is no space, as far as I
can tell. Currently it's just the entities spec that doesn't match
implementations.

well an xml parser (unlike the html one) will just do whatever it says in
the declarations it reads, so any xml application will match the entities
spec unless it has an edited version of the DTD declarations somewhere. The
XML entities spec is self implementing in that sense, any application that
uses the entities distributed with the spec will use definitions that match
the documentation.

I just saved this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "foo.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
 </head>
 <body>
   <h1>testing DotDot</h1>
    <p id="x">a&DotDot;</p>
    <script>
       var p=document.getElementById("x")
       alert(p.innerHTML.charCodeAt(1));
     </script>
  </body>
</html>



as zz.html and zz.xhtml and in firefox (nightly 46.0a1) the accent combined
in html but not in xhtml
and the alert showed 32 for the xhtml version and 8412 for the HTML
version.

So as far as I can see firefox at least follows the standard XML
declarations for XHTML with a space. and follows the html spec, with no
space in text/html.

I attach a screenshot.

David

I wrote

in firefox (nightly 46.0a1) the accent combined in html but not in xhtml

and the alert showed 32 for the xhtml version and 8412 for the HTML
version.

So as far as I can see firefox at least follows the standard XML
declarations for XHTML with a space. and follows the html spec, with no
space in text/html.

I now see Chrome and IE 11 get this wrong (as currently specified) and the
alert shows 8412 in both cases.

This whole thing started with a recommendation from the I18n group that
entities not start with a combining character
If there is a plan to change this it would probably be a good idea to get
review from them.

David

OK I see that in Firefox with your test. I had tested this in Live DOM Viewer earlier, but I see now that I made a mistake. Corrected version is http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3804 where Firefox does use a space in XHTML. My bad.

Thanks David. The result you indicate for Gecko is not surprising to me, given that the entities are defined in two places: one table generated from the HTML5 data (https://dxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5NamedCharactersInclude.h) and the DTD taken from the XML for entity spec (https://dxr.mozilla.org/mozilla-central/source/dom/xml/htmlmathml-f.ent).

The point is that the XML entity spec is actually the original source and the HTML5 data are derived from it (although maybe not directly from unicode.xml). So you will have incompatibilities each time one program uses the original source VS another one uses the HTML5 data generated from the errorneous script.

On 5 January 2016 at 09:58, Frédéric Wang notifications@github.com wrote:

Thanks David. The result you indicate for Gecko is not surprising to me,
given that the entities are defined in two places: one table generated from
the HTML5 data (
https://dxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5NamedCharactersInclude.h)
and the DTD taken from the XML for entity spec (
https://dxr.mozilla.org/mozilla-central/source/dom/xml/htmlmathml-f.ent).

The point is that the XML entity spec is actually the original source and
the HTML5 data are derived from it (although maybe not directly from
unicode.xml). So you will have incompatibilities each time one program uses
the original source VS another one uses the HTML5 data generated from the
errorneous script.

Yes although one could argue that it is not an incompatibility just a
difference between xml and html parsing (sadly there are many of those, but
too late to worry too much about that now), possibly html should change
here but if it doesn't I think that's OK as well, I'm happy to let the
implementers decide that (and I'll just add a note to the entities spec
saying what the decision is)

The fact that some of the browsers are getting the wrong results in XML
parsing is more worrying.

So to summarise current implementations:

Test Blink, WebKit Gecko, Presto Edge
HTML no space no space no space
XHTML no space space &DownBreve; not supported?

Edge in XHTML doesn't show a hat at all, just "a", apparently. I suppose Edge doesn't support the entity and its behavior is to expand unknown entities to the empty string, maybe?

On 5 January 2016 at 10:33, Simon Pieters notifications@github.com wrote:

So to summarise current implementations:
Test http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3804 Blink,
WebKit Gecko, Presto Edge HTML no space no space ? XHTML no space space ?

Note that the html and xhtml cases are rather different in that in the html
case
the listed implementations are essentially all you need to consider,
but as noted earlier the definitions are read declaratively by any xml
parser
for each document in the xhtml case, they are not built in.

As far as I know the only publicly advertised set of declarations that
define these
are

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

so unless systems have locally modified DTD, every XML editor, or XML
processing pipeline
given a document using these entities will add a space.

David

Actually, the HTML standard defines the DTD for XML/XHTML: https://html.spec.whatwg.org/multipage/xhtml.html#parsing-xhtml-documents. As far as I can tell that is a DTD generated from the same data so they might very well not have a space either.

On 5 January 2016 at 10:44, Anne van Kesteren notifications@github.com
wrote:

Actually, the HTML standard defines the DTD for XML/XHTML:
https://html.spec.whatwg.org/multipage/xhtml.html#parsing-xhtml-documents.
As far as I can tell that is a DTD generated from the same data so they
might very well not have a space either.

Sorry you are right (I did know that once:-) yes the inline data: URL
version. That doesn't have a space.

OK...

Personally I'd suggest leaving HTML with no space and amending the XHTML
version in the HTML spec to match
htmlmathml-f.ent
or it could actually reference that rather than just copy it inaccurately
without acknowledgement:-)

As I said earlier the spaces were added after review by the I18n group and
the comments in charmod (which admittedly never got to REC status) saying
that xml entities should not start with a combining character.

But whatever. If you can get some interoperable version between the major
browsers, go for that and I'll adjust the entities spec and
htmlmathml-f.ent to match.

It may be worth noting that it is in any case virtually impossible to use
any of these entities in a reasonable way in XHTML.

The only way to use XHTML and make the document be processed in standards
mode is to specify a DTD that does not define these entities

This was addressed in a bug report and related change proposal here

http://www.w3.org/TR/2013/WD-xhtml-pubid-20130822/

but it was stalled as Ian declined to change the HTML spec in advance of
the major implementations changing and
I decided there are only so many battles you can fight and let it drop.

So if you were planning on making changes here you might also want to adopt
the one line change suggested in the above.

Okay, so I think my preference would be that we keep the interoperability we have for HTML. Then:

  1. Attempt to get interoperability in XML that matches what browsers do for HTML. Even though this goes against the i18n WG's wishes it does not seem like a big enough deal to change HTML over, and having HTML and XML be different seems worse.
  2. Per http://www.w3.org/TR/2013/WD-xhtml-pubid-20130822/ stop using U+00A0 so copy-and-pasting identifiers becomes possible.
  3. Acknowledge where the named character references come from. It seems @davidcarlisle would prefer that. It seems the easiest would be at the end of section 12.5.
  4. Consider adding -//W3C//ENTITIES HTML MathML Set//EN//XML as identifier. If we were to do this I would prefer something easier, e.g. web-entities. Something you do not have to copy-and-paste.

On 8 January 2016 at 13:17, Anne van Kesteren notifications@github.com
wrote:

Okay, so I think my preference would be that we keep the interoperability
we have for HTML. Then:

  1. Attempt to get interoperability in XML that matches what browsers
    do for HTML. Even though this goes against the i18n WG's wishes it does not
    seem like a big enough deal to change HTML over, and having HTML and XML be
    different seems worse.
  2. Per http://www.w3.org/TR/2013/WD-xhtml-pubid-20130822/ stop using
    U+00A0 so copy-and-pasting identifiers becomes possible.
  3. Acknowledge where the named character references come from. It
    seems @davidcarlisle https://github.com/davidcarlisle would prefer
    that. It seems the easiest would be at the end of section 12.5.
  4. Consider adding -//W3C//ENTITIES HTML MathML Set//EN//XML as
    identifier. If we were to do this I would prefer something easier, e.g.
    web-entities. Something you do not have to copy-and-paste.

If you can get browser implementer agreement on that I'm sure I can get
working group agreement that the editor's draft of the entities spec (now)
on github
at

https://w3c.github.io/xml-entities/ (or
http://www.w3.org/2003/entities/2007doc/ which is the same thing)

says the right thing by the time you reference it:-)

The PUBLIC identifier with the wacky // syntax is FPI syntax for SGML
compatibility but XML doesn't insist on FPI syntax there. I wouldn't
personally argue for keeping that (changing anything is mildly inconvenient
but other than that no objection) but I flag it as it's one change in your
list above that may get comment if
I were to try to push an updated editors draft through the W3C review
process.

David

If the option chosen is to keep no-space for HTML5 then I would prefer XHTML as well as the XML entity specification to match that behavior too. I'm not sure I understood whether @davidcarlisle is happy with that or whether that would be rejected by the I18n group and he just wants to add a note about the divergence. Maybe the XML Entity spec should just say that it does not respect Charmod-norm for these 4 entity names, until equivalent non-combining characters are available in Unicode (similar to what the MathML spec says). IIUC, using the the XHTML version of the entities is only possible in Gecko so it will be easy to fix browser implementation...

On 8 January 2016 at 13:41, Frédéric Wang notifications@github.com wrote:

If the option chosen is to keep no-space for HTML5 then I would prefer
XHTML as well as the XML entity specification to match that behavior too.
I'm not sure I understood whether @davidcarlisle
https://github.com/davidcarlisle is happy with that or whether that
would be rejected by the I18n group and he just wants to add a note about
the divergence. Maybe the XML Entity spec should just say that it does not
respect Charmod-norm for these 4 entity names, until equivalent
non-combining characters are available in Unicode (similar to what the
MathML spec says). IIUC, using the the XHTML version of the entities is
only possible in Gecko so it will be easy to fix browser implementation...

I am sure that's what Anne meant, that HTML, XHTML, MathML, the HTML spec
and the xml-entities spec all end up with no spaces here.

Maybe the XML Entity spec should just say that it does not respect
Charmod-norm for these 4 entity names,

Yes whatever the final outcome of this is I should definitely add some
text to the entities spec clarifying the situation. Sorry I must accept
some responsibility for the situation not being clear in the first place
(although it wasn't all my fault:-)

David

@davidcarlisle do you have suggested wording for point 3 above? Might be best as a new issue against HTML: https://github.com/whatwg/html/issues/new.

On 8 January 2016 at 13:48, Anne van Kesteren notifications@github.com
wrote:

@davidcarlisle https://github.com/davidcarlisle do you have suggested
wording for point 3 above? Might be best as a new issue against HTML:
https://github.com/whatwg/html/issues/new.

point 3 being referring to entities spec....

To be honest I'm not that bothered (It would be nice if it did, but I don't
feel I should tell people how to be nice:-) You could for example not add
any words at all (it's not really that relevant to anyone but me and
Patrick who did the work of trying to match names to numbers for so many
years as Unicode gradually added the characters needed) but perhaps just
add the entities spec to the HTML references list?

David

I updated the editor's draft of xml-entities spec so the four cases of entities using a combining character are not guarded by a space.

Extra section detailing this:

https://w3c.github.io/xml-entities/#chars_math-combining-tables

and diff of the main entity definition file

w3c/xml-entities@59df3c0#diff-91aa5c243fe976a2a456dc12cd885c5d

Note this is an editor's draft and it's not yet passed by anyone (or even all the editors) but I thought I'd add a notification here, comments welcome here or on the xml-entities commit

David

@domenic tested #42 (comment) above in Edge, so I have updated the comment. (Edge appears to not support the entity at all in XHTML.)

Given https://bugzilla.mozilla.org/show_bug.cgi?id=1223829 and Edge not doing much sensible three out of four of #42 (comment) are done. Adding a new public identifier should be done through a new issue against https://github.com/whatwg/html.

@davidcarlisle is taking care of upstreaming this change (though due to the way our tools work there is no issue either way downstream).

I think we should close this.

Closing, to be continued in whatwg/html#500 if we can find implementer interest.