google/gumbo-parser

Invalid parsing of whitespace at document end

ScumCoder opened this issue · 2 comments

When parsing a trivial document, the GumboStringPiece containing the original_text of the GumboText describing GUMBO_NODE_WHITESPACE, has incorrect length value, which causes it to include closing tags.

Also, the text field contains two linebreaks instead of one.

See SSCCE here.

Used version is aa91b27.

Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this

<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>

should produce a root HTML node with five children, not three:

  1. WHITESPACE
  2. HEAD
  3. WHITESPACE
  4. BODY
  5. WHITESPACE

each whitespace consisting of a single newline character.

Come to think about it, there is something fishy about previous whitespaces as well.
A document looking like this

<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>

should produce a root HTML node with five children, not three:

  1. WHITESPACE
  2. HEAD
  3. WHITESPACE
  4. BODY
  5. WHITESPACE

each whitespace consisting of a single newline character.

If you load that document into Chromium and run document.documentElement.childNodes.length in the console, it gives a result of 3. Likewise for Firefox.

So without consulting the spec, I'm inclined to think Gumbo is doing what it's supposed to do.