whatwg/html

Clarify using CDATA in HTML context

Opened this issue · 12 comments

What is the issue with the HTML Standard?

Currently, the HTML standard doesn't get clear on whether a <![CDATA[ ]]> section may be used in HTML context:

https://html.spec.whatwg.org/multipage/syntax.html#cdata-sections

There is just an example that – speaking only for the example itself – claims "CDATA sections can only be used in foreign content (MathML or SVG)."

Is this statement true for HTML? Then it should be moved outside the example heading.

Per spec it's a conformance error; see https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state.

I agree that this should probably be stated somewhere near where you list, instead of just inside the parser. I'm not really sure what the best conventions are for this sort of duplicate conformance requirement, but I know we have a variety of them.

I guess this is already implicit in https://html.spec.whatwg.org/#elements-2 actually?

The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depend on the content model of that element, as described earlier in this specification.

and no content models allow CDATA sections.

But yeah, your idea of just moving this sentence outside of the example might be reasonable.

I think it could be a note instead of being part of the example, but we probably wouldn't want to restate it normatively as it indeed already follows from where it is referenced?

@annevk, I thought you were the one who generally argued for the duplicate-normative-conformance-requirements approach, per whatwg/url#704 (comment) ?

I'm okay with separate requirements for "parsing" and "writing", but here we are talking about duplicating a "writing" requirement, no?

Only foreign elements are defined as allowing CDATA sections:

Foreign elements whose start tag is marked as self-closing can't have any contents (since, again, as there's no end tag, no content can be put between the start tag and the end tag). Foreign elements whose start tag is not marked as self-closing can have text, character references, CDATA sections, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.

I guess it depends on whether you find it clear that "foreign elements can have CDATA sections" implies "non-foreign elements cannot have CDATA sections". I think that's probably technically how it is written, but kind of confusing.

The writing section is kind of written that way. It starts with a document and goes downward from there.

From my perspective, a statement like "foreign elements can have CDATA sections" is not exhaustive nor exclusive enough. It's similar to saying "1 > 0". That wouldn't exclude 2 from also being greater than 0.

Sure, but coupled with the next paragraph (and other text in that section) it's quite clear though:

Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.

I cannot find the text you are referring to in the above mentioned document near the CDATA sections section.

I agree it’s worth adding some clarification in the CDATA sections itself, and I think we could do that with just this:

-  <p><dfn data-x="syntax-cdata">CDATA sections</dfn> must consist of the following components, in
-  this order:</p>
+  <p><dfn data-x="syntax-cdata">CDATA sections</dfn> can only be used in in foreign content (MathML
+  or SVG), and must consist of the following components, in this order:</p>

I don’t think it’s necessary to normatively restate the requirement anywhere; instead just that “can” there is sufficient — given that the actual normative document-conformance requirements are stated in the places Anne cited.

(And for the record here: the normativity follows from the fact that the only place where the spec references the “CDATA sections” definition is in the enumeration of what foreign elements are limited to consisting of — which explicitly includes CDATA sections; while the corresponding enumeration of what normal elements are limited to consisting of explicitly does not include CDATA sections — so the spec already states that CDATA sections are explicitly not allowed in normal elements.)

When writing WordPress’ HTML parser I found the terminology confusing and think there’s room to improve the communication around CDATA sections. Specifically, I find it’s confusing for a human looking at the syntax.

<![CDATA[]]>

What is this fragment of syntax? I presume most people will look at that and say, “it’s a CDATA section.” The HTML specification, however, must know the context around the fragment and will either say, “it’s a CDATA section,” or more likely, “it’s a syntax error that creates a bogus comment.”

I know in discussions with others this has been hard to communicate, as we often ask the question, “What should happen when a CDATA section appears within HTML elements?” The cheap answer is how I feel the specification words it: this can’t happen - the question is invalid.

So somehow it might be clarifying to expand on this where CDATA is mentioned at first:

CDATA sections can only be used in foreign content (MathML or SVG).
Everywhere else that they appear to exist is considered invalid HTML and the token transforms into [a bogus comment](link to tokenizing step handling this).