Depth-First Search not really Depth-First ?
Closed this issue · 3 comments
Thanks for this great, easy-to-use, smartly thought out, library.
According to comments, the selectNodes function is a DFS implementation. I'm not completely sure (manual recursion is not my forte), but I think it's not.
Example case:
$ stack ghci
> :set -XOverloadedString
> let xml = "<html><div><p>p1</p><p>p2</p><blockquote><p>p3</p></blockquote><p>p4</p></div></html>"
> scrapeStringLike xml (texts $ "div" // "p")
Just ["p1","p2","p4","p3"]
But this XML:
<html>
<div>
<p>p1</p>
<p>p2</p>
<blockquote><p>p3</p></blockquote>
<p>p4</p>
</div>
</html>
Should, in DFS, be processed through in the order given by the letter in parenthesis in the example below:
Span 0 17 (A)
|
`- Span 1 16 (B)
|
+- Span 2 4 (C)
|
+- Span 5 7 (D)
|
+- Span 8 12 (E)
| |
| `- Span 9 11 (F)
|
`- Span 13 15 (G)
So the selector should yield Just ["p1","p2","p3","p4"]
.
Unless the previous result was the intended behaviour ? In which case I think comments and README should warn about this.
Suggested fix
In this pattern (https://github.com/fimad/scalpel/blob/master/scalpel-core/src/Text/HTML/Scalpel/Internal/Select.hs#L222), we should not have:
| otherwise = selectNodes [n] (tags, fs, ctx) $ selectNodes [n] (tags, Tree.subForest f, ctx) acc
Or we won't get depth-first; we should rather have:
| otherwise = selectNodes [n] (tags, Tree.subForest f, ctx) $ selectNodes [n] (tags, fs, ctx) acc
This pass tests and gives the proper result with the former ghci example. I have a PR ready to fire if needed.
Yep it is intended to be DFS, I'd be happy to take a PR that updates the behavior accordingly.
There you go : #60
I also added a simple test case, so if you have a better solution than mine, you can just pick the test to check yours against the bug.
Just pushed v0.5.1 with this change. Thanks again for the fix!