fimad/scalpel

Depth-First Search not really Depth-First ?

Closed this issue · 3 comments

Thanks for this great, easy-to-use, smartly thought out, library.

According to comments, the selectNodes function is a DFS implementation. I'm not completely sure (manual recursion is not my forte), but I think it's not.

Example case:

$ stack ghci
> :set -XOverloadedString
> let xml = "<html><div><p>p1</p><p>p2</p><blockquote><p>p3</p></blockquote><p>p4</p></div></html>"
> scrapeStringLike xml (texts $ "div" // "p")
Just ["p1","p2","p4","p3"]

But this XML:

<html>
  <div>
    <p>p1</p>
    <p>p2</p>
    <blockquote><p>p3</p></blockquote>
    <p>p4</p>
  </div>
</html>

Should, in DFS, be processed through in the order given by the letter in parenthesis in the example below:

Span 0 17 (A)
|
`- Span 1 16 (B)
   |
   +- Span 2 4 (C)
   |
   +- Span 5 7 (D)
   |
   +- Span 8 12 (E)
   |  |
   |  `- Span 9 11 (F)
   |
   `- Span 13 15 (G)

So the selector should yield Just ["p1","p2","p3","p4"].

Unless the previous result was the intended behaviour ? In which case I think comments and README should warn about this.

Suggested fix

In this pattern (https://github.com/fimad/scalpel/blob/master/scalpel-core/src/Text/HTML/Scalpel/Internal/Select.hs#L222), we should not have:

| otherwise = selectNodes [n] (tags, fs, ctx) $ selectNodes [n] (tags, Tree.subForest f, ctx) acc

Or we won't get depth-first; we should rather have:

| otherwise  = selectNodes [n] (tags, Tree.subForest f, ctx) $ selectNodes [n] (tags, fs, ctx) acc

This pass tests and gives the proper result with the former ghci example. I have a PR ready to fire if needed.

fimad commented

Yep it is intended to be DFS, I'd be happy to take a PR that updates the behavior accordingly.

There you go : #60

I also added a simple test case, so if you have a better solution than mine, you can just pick the test to check yours against the bug.

fimad commented

Just pushed v0.5.1 with this change. Thanks again for the fix!