Xpath with | character does not return element with duplicated text
WithoutPants opened this issue · 5 comments
Using a modified version of the html sample in query_test.go
:
const htmlQuerySample = `<!DOCTYPE html><html lang="en-US">
<head>
<title>Hello,World!</title>
</head>
<body>
<div class="container">
<header>
<!-- Logo -->
<h1>City Gallery</h1>
</header>
<nav>
<ul>
<li><a href="/London">London</a></li>
<li><a href="/Paris">Paris</a></li>
<li><a href="/Tokyo">Tokyo</a></li>
<li><a href="/Tokyo2">Tokyo</a></li>
</ul>
</nav>
<article>
<h1>London</h1>
<img src="pic_mountain.jpg" alt="Mountain View" style="width:304px;height:228px;">
<p>London is the capital city of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</article>
<footer>Copyright © W3Schools.com</footer>
</div>
</body>
</html>
`
(added extra Tokyo
li
element)
Using xpath "//nav/ul/li/a/text()|//div/ul/li/a/text()"
(the second condition after the |
is just an example and not meant to find anything additional), I would expect four elements to be returned - London
, Paris
, Tokyo
, Tokyo
. Instead, it returns three, dropping the second Tokyo
element. Removing the |
character and second condition returns four elements. Similarly, if I remove the text()
part, then it also returns four elements.
Small unit test illustrating the issue:
func TestXpathQuery(t *testing.T) {
reader := strings.NewReader(htmlQuerySample)
doc, err := htmlquery.Parse(reader)
if err != nil {
t.Errorf("Error loading document: %s", err.Error())
return
}
doTestXpath(t, doc, "//nav/ul/li/a/text()")
doTestXpath(t, doc, "//nav/ul/li/a|//div/ul/li/a")
doTestXpath(t, doc, "//nav/ul/li/a/text()|//div/ul/li/a/text()")
}
func doTestXpath(t *testing.T, doc *html.Node, xpath string) {
found, err := htmlquery.QueryAll(doc, xpath)
if err != nil {
t.Error(err.Error())
}
if len(found) != 4 {
t.Errorf("Xpath: %s - expected 4 elements found %d", xpath, len(found))
}
}
Gives the following results:
--- FAIL: TestXpathQuery (0.00s)
...\xpath_test.go:978: Xpath: //nav/ul/li/a/text()|//div/ul/li/a/text() - expected 4 elements found 3
FAIL
htmlquery
package will auto remove duplicated text since #20 .
Except that it doesn't consistently.
Take the source string <sup>24</sup><sup>24</sup><sup>24</sup>
Perform a QueryAll
using //sup/text()
.
Expected output would be:
24
24
24
Actual output is the same - it works as intended.
Now perform QueryAll
using //sup/text()|/div/text()
. The only difference here is the extra option. The expected output should be the same.
Actual output for this one is:
24
So at the very least there is inconsistency between the behaviour of the two queries, and my expectation would be that it is the second case that is performing incorrectly.
This is a bug, I had fixed at antchfx/xpath@a015dcb
Thanks for your feedback.
Thank you. I can confirm that this has fixed the issue.
When are we likely to see a new release with this fix?