Invalid parsing of whitespace at document end

Question

Invalid parsing of whitespace at document end

ScumCoder opened this issue 5 years ago · 2 comments

When parsing a trivial document, the GumboStringPiece containing the original_text of the GumboText describing GUMBO_NODE_WHITESPACE, has incorrect length value, which causes it to include closing tags.

Also, the text field contains two linebreaks instead of one.

See SSCCE here.

Used version is aa91b27.

Answer 1 · 2019-07-09T21:54:49.000Z

Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this

<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>

should produce a root HTML node with five children, not three:

WHITESPACE
HEAD
WHITESPACE
BODY
WHITESPACE

each whitespace consisting of a single newline character.

Answer 2 · 2020-02-06T08:45:46.000Z

Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this
<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>
should produce a root HTML node with five children, not three:

WHITESPACE

HEAD

WHITESPACE

BODY

WHITESPACE

each whitespace consisting of a single newline character.

If you load that document into Chromium and run document.documentElement.childNodes.length in the console, it gives a result of 3. Likewise for Firefox.

So without consulting the spec, I'm inclined to think Gumbo is doing what it's supposed to do.