PuerkitoBio/goquery

doc.Find("body").Contents().Length() is greater than zero even though <body> is empty

Closed this issue · 2 comments

Hello,

The following program:

package main

import (
        "bytes"
        "fmt"
        "github.com/PuerkitoBio/goquery"
)

func numNodes(code string) {
        r := bytes.NewReader([]byte(code))

        doc, err := goquery.NewDocumentFromReader(r)
        if err != nil {
                panic(err)
        }

        fmt.Println("Length of selection =", doc.Find("body").Contents().Length())
}

func main() {
        code1 := "<html><head></head><body></body></html>"
        numNodes(code1)

        code2 := "<html><head></head><body></body>\n</html>"
        numNodes(code2)
}

produces this output:

Length of selection = 0
Length of selection = 1

Is it normal to get "Length of selection = 1" for the HTML code that is contained in the code2 variable? I expect to get "Length of selection = 0" because the <body> tag is empty.

mna commented

Hello,

It's normal due to how the html5 parser interprets this html - it moves the newline inside the <body> so it is no longer empty. When you see weird things with the html, it's always a good idea to print the document after it has been parsed into goquery to see what goquery sees, e.g.:

	doc, err := goquery.NewDocumentFromReader(r)
	if err != nil {
		panic(err)
	}
	fmt.Println(goquery.OuterHtml(doc.Selection))

You can see that in your second call, the actual document looks like this:

<html><head></head><body>
</body></html>

And since the Contents method selects not only elements, but also comments and text nodes, it has selected the text node containing the newline.

Hope this helps,
Martin

Thank you very much for your explanation, mna!