google/gumbo-parser

gumbo and the spec itself do not detect/fix incorrect use of span tags - should they?

kevinhendricks opened this issue · 2 comments

Please forgive me for my ignorance here but ...

According to the html5 spec, a span element's permitted contents is "Phrasing content" which does not include the "p" element.

Yet gumbo will happily parse the following with no changes made to the node tree and no errors reported.

<html>
<head>
<title></title>
</head>
<body>
<div>
<span class="blah">
<p>this is paragraph 1</p>
<p>this is paragraph 2</p>
</span>
</div>
</body>
</html>

I then checked the https://html.spec.whatwg.org/ and the http://www.w3.org/TR/html51/syntax.html#syntax sites and there is basically no mention of span at all in the spec (except one when parsing in foreign content).

What is the reason that there are specific rules for what span can contain that are being ignored by the parsing spec? Should it be ignoring them? What does this mean for web browsers? Are they allowing spans to be everywhere? Would/should the css class="blah" rules be applied to both paragraph 1 and 2 in the example above?

This is the distinction between authoring conformance requirements and user agent conformance requirements: in short, authors are required to write a subset of what user agents can process. This is especially obvious when it comes to the schema, defined in terms of "permitted contents": it's very easy to construct a tree that the parser will accept without errors that violates the schema. Why doesn't the parser check? It's actually quite hard to see if the parser could reasonably check the schema in the face of foster parenting and the adoption agency algorithm, and it was long ago decided to simply not really try.

What is the reason that there are specific rules for what span can contain that are being ignored by the parsing spec?

The parser makes no attempt to constrain trees to what a conforming document can have. (i.e., the parser doesn't try and impose the schema on all documents)

Should it be ignoring them?

No.

What does this mean for web browsers?

Not very much. It's just another possible DOM. (And in web browsers any DOM is possible through scripting.)

Are they allowing spans to be everywhere?

Not quite everywhere—<table><span>foo</span></table> is an obvious example (as it gets foster parented.)

Would/should the css class="blah" rules be applied to both paragraph 1 and 2 in the example above?

If they are inherited (in the CSS sense), yes.

This is an instance of the distinction between "parsing" and "validation". The spec has a notion of a "valid HTML document" which conforms to all the rules of HTML: this includes tokenizing and parsing without errors, but it also includes the element containment rules in the HTML DTD, which is what you're referring to.

The HTML5 spec also, by design, allows for invalid HTML. One of its goals is to produce a DOM for every potential string of input characters. The DOM might not validate under the rules of the DTD, but it will parse and give you something sensible that can be displayed, styled, and manipulated with JS.

If you're familiar with compiler design, think of tokenization and parsing as producing lexical and parse errors, respectively. And then think of DTD validation as typechecking. It's performed in a separate pass, and in some languages is optional.

As for web browsers - recent ones will parse according to the HTML5 parsing algorithm. Yes, this means that they will sometimes produce HTML that doesn't validate against the DTD, like paragraphs inside a span. Sometimes, browsers behave in strange ways when they encounter invalid HTML - for example, a Google Doodle I worked on broke the homepage in IE because it stuck a div inside an anchor inside a span. These are problems that occur at the rendering & event model layers, however, not parsing.

In practice, almost no major website validates. Take a look at the validator reports for Google, Reddit, and GitHub. Oftentimes these errors are deliberate, eg. at Google a lot of our validation errors were because the "right way" added extra latency to the SRP, and the practical benefits of serving people their content faster outweighed conformance to a spec when all major browsers functioned properly.