Question about parsing nested tables and finding outer elements
atye opened this issue · 5 comments
I want to parse only the outer tr and td elements of a table and ignore the inner tables. My method is to find the outer tr elements and with that selection, find the td elements. I should find 2 tr elements and 3 td elements.
In the example below, I don't understand why the last method of using .start > tbody > tr > td
in the row selection works to find the 3 outer td elements. Doesn't Find
only search descendants? The element with the start class and the tbody element are parents of the row selection, right?
package main
import (
"fmt"
"log"
"strings"
"github.com/PuerkitoBio/goquery"
)
var data = `
<!DOCTYPE html>
<html>
<body>
<table class="start">
<tbody>
<tr>
<td>test1</td>
<td>test2</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>test3</td>
<td>test4</td>
</tr>
<tr>
<td>test5</td>
<td>test6</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
`
func main() {
doc, err := goquery.NewDocumentFromReader(strings.NewReader(data))
if err != nil {
log.Fatal(err)
}
// find outer tr
rowSelection := doc.Find(".start > tbody > tr")
fmt.Println(len(rowSelection.Nodes))
// finds all td
colSelection := rowSelection.Find("td")
fmt.Println(len(colSelection.Nodes))
// finds all td
colSelection = rowSelection.Find("tr > td")
fmt.Println(len(colSelection.Nodes))
// finds no td
colSelection = rowSelection.Find("> td")
fmt.Println(len(colSelection.Nodes))
// finds outer td
colSelection = rowSelection.Find(".start > tbody > tr > td")
fmt.Println(len(colSelection.Nodes))
}
2
7
7
0
3
Hello Aaron,
Thanks for the nice minimal reproduction program, much appreciated. It does look like a bug, tried with jQuery and it does not behave this way. I'm trying to figure out if it's a goquery issue or something in the lower-level cascadia package that handles the CSS selectors. Will update this issue with whatever I find.
Thanks,
Martin
Allright so after investigation, it is related to cascadia
but I think it might be by-design and working as intended although differently from jQuery (see the linked cascadia issue). I'll keep this open until I hear back from Andy, but that's something I'll document better if confirmed.
According to that issue, it seems "the selector is always started from the root of the document, but only descendants of the contextual node are returned (if they do match)" is expected behavior. If there is nothing else, I suppose this can be closed.
Yeah if you don't mind I'll keep it open as a reminder to address this in goquery's documentation. I haven't had time yet to get to it, but should be able to do so shortly.
Allright I added some notes about this in v1.9.1. Closing now, thanks!