Xpath with | character does not return element with duplicated text

Question

Xpath with | character does not return element with duplicated text

WithoutPants opened this issue 5 years ago · 5 comments

Using a modified version of the html sample in query_test.go:

const htmlQuerySample = `<!DOCTYPE html><html lang="en-US">
<head>
<title>Hello,World!</title>
</head>
<body>
<div class="container">
<header>
	<!-- Logo -->
   <h1>City Gallery</h1>
</header>  
<nav>
  <ul>
    <li><a href="/London">London</a></li>
    <li><a href="/Paris">Paris</a></li>
    <li><a href="/Tokyo">Tokyo</a></li>
    <li><a href="/Tokyo2">Tokyo</a></li>
  </ul>
</nav>
<article>
  <h1>London</h1>
  <img src="pic_mountain.jpg" alt="Mountain View" style="width:304px;height:228px;">
  <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
  <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</article>
<footer>Copyright &copy; W3Schools.com</footer>
</div>
</body>
</html>
`

(added extra Tokyo li element)

Using xpath "//nav/ul/li/a/text()|//div/ul/li/a/text()" (the second condition after the | is just an example and not meant to find anything additional), I would expect four elements to be returned - London, Paris, Tokyo, Tokyo. Instead, it returns three, dropping the second Tokyo element. Removing the | character and second condition returns four elements. Similarly, if I remove the text() part, then it also returns four elements.

Small unit test illustrating the issue:

func TestXpathQuery(t *testing.T) {
	reader := strings.NewReader(htmlQuerySample)
	doc, err := htmlquery.Parse(reader)

	if err != nil {
		t.Errorf("Error loading document: %s", err.Error())
		return
	}

	doTestXpath(t, doc, "//nav/ul/li/a/text()")
	doTestXpath(t, doc, "//nav/ul/li/a|//div/ul/li/a")
	doTestXpath(t, doc, "//nav/ul/li/a/text()|//div/ul/li/a/text()")
}

func doTestXpath(t *testing.T, doc *html.Node, xpath string) {
	found, err := htmlquery.QueryAll(doc, xpath)

	if err != nil {
		t.Error(err.Error())
	}

	if len(found) != 4 {
		t.Errorf("Xpath: %s - expected 4 elements found %d", xpath, len(found))
	}
}

Gives the following results:

--- FAIL: TestXpathQuery (0.00s)
    ...\xpath_test.go:978: Xpath: //nav/ul/li/a/text()|//div/ul/li/a/text() - expected 4 elements found 3
FAIL

Answer 1 · 2020-04-13T02:16:00.000Z

htmlquery package will auto remove duplicated text since #20 .

Answer 2 · 2020-04-16T01:29:26.000Z

htmlquery package will auto remove duplicated text since #20 .

Except that it doesn't consistently.

Take the source string 242424

Perform a QueryAll using //sup/text().

Expected output would be:

24
24
24

Actual output is the same - it works as intended.

Now perform QueryAll using //sup/text()|/div/text(). The only difference here is the extra option. The expected output should be the same.

Actual output for this one is:

So at the very least there is inconsistency between the behaviour of the two queries, and my expectation would be that it is the second case that is performing incorrectly.

Answer 3 · 2020-04-16T10:13:24.000Z

This is a bug, I had fixed at antchfx/xpath@a015dcb
Thanks for your feedback.

Answer 4 · 2020-04-16T22:41:38.000Z

Thank you. I can confirm that this has fixed the issue.

Answer 5 · 2020-04-22T06:05:27.000Z

When are we likely to see a new release with this fix?