PuerkitoBio/goquery

Question about parsing nested tables and finding outer elements

atye opened this issue · 5 comments

atye commented

I want to parse only the outer tr and td elements of a table and ignore the inner tables. My method is to find the outer tr elements and with that selection, find the td elements. I should find 2 tr elements and 3 td elements.

In the example below, I don't understand why the last method of using .start > tbody > tr > td in the row selection works to find the 3 outer td elements. Doesn't Find only search descendants? The element with the start class and the tbody element are parents of the row selection, right?

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

var data = `
<!DOCTYPE html>
<html>
<body>
    <table class="start">
        <tbody>
            <tr>
                <td>test1</td>
                <td>test2</td>
            </tr>
            <tr>
		<td>
                <table>
                    <tbody>
                        <tr>
                           <td>test3</td>
			   <td>test4</td> 
                        </tr>
                        <tr>
                           <td>test5</td>
			   <td>test6</td> 
                        </tr>
                    </tbody>
                </table>
		</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
`

func main() {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		log.Fatal(err)
	}

	// find outer tr
	rowSelection := doc.Find(".start > tbody > tr")
	fmt.Println(len(rowSelection.Nodes))

	// finds all td
	colSelection := rowSelection.Find("td")
	fmt.Println(len(colSelection.Nodes))

	// finds all td
	colSelection = rowSelection.Find("tr > td")
	fmt.Println(len(colSelection.Nodes))

	// finds no td
	colSelection = rowSelection.Find("> td")
	fmt.Println(len(colSelection.Nodes))

	// finds outer td
	colSelection = rowSelection.Find(".start > tbody > tr > td")
	fmt.Println(len(colSelection.Nodes))
}
2
7
7
0
3
mna commented

Hello Aaron,

Thanks for the nice minimal reproduction program, much appreciated. It does look like a bug, tried with jQuery and it does not behave this way. I'm trying to figure out if it's a goquery issue or something in the lower-level cascadia package that handles the CSS selectors. Will update this issue with whatever I find.

Thanks,
Martin

mna commented

Allright so after investigation, it is related to cascadia but I think it might be by-design and working as intended although differently from jQuery (see the linked cascadia issue). I'll keep this open until I hear back from Andy, but that's something I'll document better if confirmed.

According to that issue, it seems "the selector is always started from the root of the document, but only descendants of the contextual node are returned (if they do match)" is expected behavior. If there is nothing else, I suppose this can be closed.

mna commented

Yeah if you don't mind I'll keep it open as a reminder to address this in goquery's documentation. I haven't had time yet to get to it, but should be able to do so shortly.

mna commented

Allright I added some notes about this in v1.9.1. Closing now, thanks!