HTML doesn't round-trip correctly with uppercase ASCII tag names.

Question

HTML doesn't round-trip correctly with uppercase ASCII tag names.

sirrobert opened this issue 3 years ago · 2 comments

sirrobert commented 3 years ago

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

1.0.1

Link to runnable example

No response

Steps to reproduce

Here's a simple script to reproduce:

import { fromHtml } from 'hast-util-from-html'
import { toHtml }   from 'hast-util-to-html'

let tagName="H1";

// Represents html: <H1>Title</H1>
const ast = {"type":"root","children":[{"type":"element","tagName":tagName,"properties":{},"children":[{"type":"text","value":"Title","position":{"start":{"line":1,"column":5,"offset":4},"end":{"line":1,"column":10,"offset":9}}}],"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}}],"data":{"quirksMode":false},"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}};                                                                  
                                                                                  
console.log(JSON.stringify(ast));              
console.log(toHtml(ast));                      
console.log(JSON.stringify(fromHtml(toHtml(ast), {fragment:true})));

Output:

$ node ./demo.js
{"type":"root","children":[{"type":"element","tagName":"H1","properties":{},"children":[{"type":"text","value":"Title","position":{"start":{"line":1,"column":5,"offset":4},"end":{"line":1,"column":10,"offset":9}}}],"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}}],"data":{"quirksMode":false},"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}}
<H1>Title</H1>
{"type":"root","children":[{"type":"element","tagName":"h1","properties":{},"children":[{"type":"text","value":"Title","position":{"start":{"line":1,"column":5,"offset":4},"end":{"line":1,"column":10,"offset":9}}}],"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}}],"data":{"quirksMode":false},"position":{"start":{"line":1,"column":1,"offset":0},"end":{"line":1,"column":15,"offset":14}}}

Notice that the tag names are not the same case. The HTML renders correctly (toHtml()), but the parser doesn't (fromHtml()).

Expected behavior

I would expect that:

a round trip of fromHtml(toHtml(ast)) can be relied upon to produce the same AST, and
any HTML that complies with the whatwg spec would be respected in this way, such as when the HTML uses "a mix of lower- and uppercase letters" in a tag name.

This expectation is because the two specifications linked in the project readme say the following.

According to the HTML spec (https://html.spec.whatwg.org/dev/syntax.html#syntax-tag-name):

Tags contain a tag name, giving the element's name. HTML elements all have names that only use ASCII alphanumerics. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive. (emphasis added)

And according to the unist spec,

This means that the syntax tree should be able to be converted to and from JSON and produce the same tree. For example, in JavaScript, a tree can be passed through JSON.parse(JSON.stringify(tree)) and result in the same tree.

Actual behavior

The fromHtml() function appears to coerce tag names to a spec-compliant subset of the spec (all lowercase tag names). This results in a round-trip that produces different files in some cases.

I propose a solution that the fromHtml() function preserve tag name case to comply with the specifications, and that an option along the lines of {lowercaseTags:true} be provided to support the current feature of normalizing html to lowercase (which is the generally preferred industry norm).

Further, to comply with the spec, searches should coerce tag names to lowercase for the search operation only.

Affected runtime and version

node@18.15.0, hast-util-from-html@1.0.1

Affected package manager and version

npm@9.5.0

Affected OS and version

Ubuntu 21.10

Build and bundle tools

No response

Answer 1 · 2023-04-09T12:23:46.000Z

Hey!

HTML is lossy. Not all ASTs can be serialized and then parsed again and result in the same AST. This is documented in the HTML spec. See the notes in 13.3: https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments.
This means that not all ASTs can be serialized in a way that an HTML parser would parse to that same AST.

What is possible is to go string -> ast (a) -> string -> ast (b), where a and b are equivalents.

How elements are parsed and serialized is defined in the HTML spec.

document.body.innerHTML = '<H1>asd</H1>'; document.body.innerHTML // <h1>asd</h1>

Furthermore, ASTs are by definition lossy: they are abstract. Not concrete.

There is perhaps another question behind the solution you are asking about.
Presumably, you are not parsing HTML, but you are parsing some custom XML-like language.
If you deal with XML, use https://github.com/syntax-tree/xast-util-from-xml.
If you deal with a different language, you need a parser for that language!
If you do deal with HTML, this current behavior should not matter.

Answer 2 · 2023-04-09T12:24:00.000Z

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.