html5lib/html5lib-tests

Incorrect expectations in tree-construction webkit02 tests

iabudiab opened this issue · 25 comments

Correct if I'm wrong. I could prepare a PR, but wanted to verify this first.

I believe the expectations for the following tests (15, 16, 17 in webkit02) to be incorrect:
<b><em><foo><foo><foo><aside></b></em>
<b><em><foo><foo><foo><foo><foo><foo><foo><foo><foo><foo><aside></b></em>
<b><em><foo><foob><foob><foob><foob><fooc><fooc><fooc><fooc><food><aside></b></em>

This <b><em><foo><foo><foo><aside></b></em> expects the following tree:

| <html>
|   <head>
|   <body>
|     <b>
|       <em>
|         <foo>
|           <foo>
|             <foo>
|     <aside>
|       <b>

but should expect:

| <html>
|   <head>
|   <body>
|     <b>
|       <em>
|         <foo>
|           <foo>
|             <foo>
|     <aside>
|       <b>
|         <em>

The same is true for the other two tests, i.e. the <b> element, child of <aside>, should have an <em> after running the adoption agency algorithm.

Oh. Those tests. I remember them taking me quite a while to convince myself of their correctness. It looks like they're checking the interaction between the Noah's Ark clause (does <aside> introduce a marker?) and the AAA. I'm soon boarding a far-too-many-hours flight, hopefully I'll have convinced myself of their correctness or not by the end. :)

The em is removed because of "If inner loop counter is greater than three and node is in the list of active formatting elements, then remove node from the list of active formatting elements" combined with "If node is not in the list of active formatting elements, then remove node from the stack of open elements and then go back to the step labeled inner loop", no? (When the inner loop counter is 1, the bottommost foo is removed from the stack; when 2 the middle foo; when 3 the top foo.)

We have interop, and I'm pretty sure that's the key to working out what's going on.

I've doen some calculations myself.

Here is is the state when first entering the AAA with </b> token:

Stack of Open Elements Active Formatting Elements
html, body, b, em, foo, foo, foo, aside b, em
  • In the outer loop we get the following:
Formatting Element Furthest Block Common Ancestor Bookmark
b aside body 0
  • Now the inner loop.

Blink and WebKit limit the inner loop counter to 3. This is also true for the html5lib-python implementation. However, the current spec doesn't state this explicitly, at least not explicitly enough as it is the case for the outer loop Outer loop: If outer loop counter is greater than or equal to eight, then abort these steps.

So in this case the <em> wouldn't be removed via the following step If inner loop counter is greater than three and node is in the list of active formatting elements, then remove node from the list of active formatting elements

So the state directly after the inner loop is as follows:

Stack of Open Elements Active Formatting Elements
html, body, b, em, aside b, em

and the DOM tree is untouched.

  • Now the furthest block <aside> gets inserted into the common ancestor <body>
  • A new element for the token, i.e. <b>, is appended to <aside>
  • <b> is removed from the active formatting elements and reinserted at bookmark, which is 0, so this stays as it is
  • <b> is removed from the stack of open elements and is reinserted after the <aside>
  • The second iteration of the outer loops breaks because there is no furthest block and we end up with:
Stack of Open Elements Active Formatting Elements
html, body, em, aside em

with the following tree:

| <html>
|   <head>
|   <body>
|     <b>
|       <em>
|         <foo>
|           <foo>
|             <foo>
|     <aside>
|       <b>
  • The outer loop of the AAA for the </em> token is almost the same:
Formatting Element Furthest Block Common Ancestor Bookmark
em aside body 0
  • The inner loop however is left immediately, since the element directly above the <aside> in the stack of pen elements is the formatting element.
  • A new element for the token <em> is created and it gets all child nodes from <aside>, as described: Take all of the child nodes of furthest block and append them to the element created in the last step.
  • The <em> is then appended to <aside>
  • The second iteration of the outer loops breaks because there is no furthest block and we end up with:
Stack of Open Elements Active Formatting Elements
html, body, aside

with the following tree:

| <html>
|   <head>
|   <body>
|     <b>
|       <em>
|         <foo>
|           <foo>
|             <foo>
|     <aside>
|       <em>
|         <b>

Notice the ordering of <em> and <b> under <aside>. My initial assumption was wrong, however the test expectation is still incorrect.

I have also tested the HTML snippet in Chrome, Safari & Firefox, all three produce this tree, i.e. aside, em, b.

As I see it, the spec is ambiguous when it comes to the inner loop handling.

Given this is getting into clarity of the spec: cc/ @domenic @annevk

@RReverser or @zcorpan are probably much more able to address this than I am.

@zcorpan or maybe @inikulin are more likely to be able to answer this.

Just re-checked @iabudiab's steps per spec and in parse5 - test's expected value is wrong.

nox commented

I don't see what in @iabudiab's explanation disproves what @gsnedders stated in #78 (comment).

To me by the time the algorithm reaches <em>, the inner loop counter already reached 3 and thus <em> gets removed.

FWIW, html5ever passes this test.

In Gecko I get this tree for http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4199

| <html>
|   <head>
|   <body>
|     <b>
|       <em>
|         <foo>
|           <foo>
|             <foo>
|     <aside>
|       <b>

i.e. no em in the aside. But WebKit/Chromium/IE11 have aside, em, b.

cc @hsivonen

I feel like there was some change to the spec which caused fragmentation in results

From #78 (comment) (@iabudiab)

Blink and WebKit limit the inner loop counter to 3. This is also true for the html5lib-python implementation. However, the current spec doesn't state this explicitly, at least not explicitly enough as it is the case for the outer loop Outer loop: If outer loop counter is greater than or equal to eight, then abort these steps.

So in this case the <em> wouldn't be removed via the following step If inner loop counter is greater than three and node is in the list of active formatting elements, then remove node from the list of active formatting elements

You're saying that it doesn't get reached in Blink and WebKit because they break out of the loop when the inner loop counter is three, and hence that never gets hit? Given the spec doesn't have that breakout from the loop, it should therefore get removed, no?

@inikulin whatwg/html@22ce3c31 is why WebKit/Blink/IE11/html5lib-python has that behaviour

As far as I can tell, @iabudiab's proposed behaviour is what would be expected by the spec prior to whatwg/html@22ce3c31. (And hence why it matches WebKit/Blink/IE11/html5lib-python.)

@gsnedders same in parse5

Edge matches IE11, FWIW.

So is there agreement that that was a good spec change, and we should be filing bugs on WebKit/Blink/Edge/html5lib-python/parse5?

Or do we want to align with the majority, revert the spec change, and file bugs on Gecko/html5ever?

It would be nice to know motivation behind whatwg/html@22ce3c3

@domenic html5lib-python just has a generic "update to match the spec" issue (it hasn't been properly updated in a long time) :)

@inikulin https://lists.w3.org/Archives/Public/public-whatwg-archive/2013Jul/0401.html

Notably, <b><i><a><s><tt><div></b>abc</b></div></tt></s></a>xyz</i> gives a non-sensical parse-tree (where xyz precedes abc) in the old spec/WebKit/Blink/Edge.

Notably,<b><i><a><s><tt><div></b>abc</b></div></tt></s></a>xyz</i> gives a non-sensical parse-tree (where xyz precedes abc) in the old spec/WebKit/Blink/Edge.

Well, I don't see a problem. With foster parenting you will have similar results: <table><tr><td>123</td></tr><div>456</div></table>. And IMHO such cases are more common than the one with AAA.

Reordering in tables is something that all browsers did pre-HTML5. Reordering for misnested formatting elements is not.

Since it seems Gecko has successfully shipped the spec change, we should go ahead and implement it in the other browsers, too (unless we have data suggesting the new spec is less Web-compatible). The change in Gecko was https://bugzilla.mozilla.org/show_bug.cgi?id=901319

@travisleithead any update on this issue for Edge? re https://lists.w3.org/Archives/Public/public-html/2013Aug/0002.html

So we have consensus that the tests are right v. the spec, and impls should change? Closing as a result. (If that understanding is wrong, feel free to reopen it. Or ping me if you can't.)

Mmm. The 2013 bug is likely gone with our old bug database. I opened an issue on our public tracker--you can monitor progress there: https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/9854392/