PuerkitoBio/goquery

More Can't get children, where plain text is before child

jtagcat opened this issue · 3 comments

gocolly/colly#716

Gonna go sleep, couldn't find duplicates within 1min, there might be.

mna commented

Hello,

This program (taken from what you said was a failing example in the link you provided) prints baz as expected:

func main() {
	const data = `<div class="foo">Related: <span class="bar">baz</span></div>`
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		log.Fatal(err)
	}
	t := doc.Find(".foo > .bar").Text()
	fmt.Println(t)
}

Please provide a complete runnable program that reproduces the issue (like I did here).

Thanks,
Martin

Ah yes, I'm scraping the same 3 pagetypes with iteration count over 10. Meanwhile I dropped down a level in abstractions.

Some wrapper library probably called .Children() expecting to iterate over all children.

	h := `<div class="info">
    <h1>
        <a href="#">junk</a>


        Title text foo bar
    </h1>
	</div>`
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(h))
	if err != nil {
		panic(err)
	}

	s := doc.Find(".info > h1") // 3 children:
	// FirstChild: Data: "\n        "
	// NextSibling: Data: "a"
	// NextSibling: Data: "\n\n\n        Title text foo bar\n    "
	c := s.Children()     // only 2nd is available
	println(len(c.Nodes)) // 1

	s = doc.Find(".info")        // 1 > 3 children
	s = s.ChildrenFiltered("h1") // 3 children
	c = s.Children()             // 1 children, but PrevSibling and NextSibling are available (.Next(), .Last() don't work)
	println(len(c.Nodes))        // 1
mna commented

Yeah you're mixing looking at the raw html nodes and the goquery selections, those may differ and that's normal behaviour. The Children family of functions returns a selection of matching elements (i.e. <div>, <h1>, etc.) and does not select non-element nodes like raw text (documented here: https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Children - "gets the child elements of each element"). What you seem to want is the Contents family of functions, which also selects text and comment nodes: https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Contents.

Indeed, even if there are Prev/NextSibling on the selected node, it doesn't mean Next() or Last() will select them, as they work on elements and not raw text or comment nodes.

Hope this helps,
Martin