Invalid parsing of whitespace at document end
ScumCoder opened this issue · 2 comments
When parsing a trivial document, the GumboStringPiece
containing the original_text
of the GumboText
describing GUMBO_NODE_WHITESPACE
, has incorrect length
value, which causes it to include closing tags.
Also, the text
field contains two linebreaks instead of one.
See SSCCE here.
Used version is aa91b27.
Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this
<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>
should produce a root HTML node with five children, not three:
- WHITESPACE
- HEAD
- WHITESPACE
- BODY
- WHITESPACE
each whitespace consisting of a single newline character.
Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this<!DOCTYPE html> <html> <head> </head> <body> </body> </html>
should produce a root HTML node with five children, not three:
- WHITESPACE
- HEAD
- WHITESPACE
- BODY
- WHITESPACE
each whitespace consisting of a single newline character.
If you load that document into Chromium and run document.documentElement.childNodes.length
in the console, it gives a result of 3. Likewise for Firefox.
So without consulting the spec, I'm inclined to think Gumbo is doing what it's supposed to do.