Allow direct manipulation of TagSpec object
Opened this issue · 3 comments
Sometimes I would rather work with the node tree (and thus TagSpec
) itself rather than the Scraper
/ SerialScraper
interface.
It would be optimal for my use case if TagSpec
and various functions for manipulating it (children
, name
etc.) were exposed as a low level api. The current high level api would then be a layer on top of that and would be the same as it is currently, except perhaps some extra functions for dropping into the low level api when desired.
Of course TagSpec
itself would have to be an abstract data type with a hidden constructor / fields rather than a tuple to preserve various invariants from being violated. It would also probably be worth renaming the type to something like Html
or Nodes
or similar. Another thing to consider would be whether or not its worth having explicit types for when you know you have a single node vs potentially zero or multiple nodes (Tree
/Node
vs Forest
/Nodes
/Html
) to make functions like name :: Node str -> str
make more sense.
This has come up a few times. Historically, I've pushed back on exposing this type directly since I consider it an implementation detail, and I've already changed the type signature dramatically several times to get things to run faster.
However, I would be OK reworking things so that we are able to expose some more low level APIs while keeping the real internals hidden and out of the public API.
Do you have a proposal for what such an API would look like or an idea of what sort of operations you are looking for?
Essentially I would like 3 different types:
A list/vector of Node
type that signifies a chunk of Html. The low level scrapeStringLike
would want to output this type, and things like the children of an Element
would have this type.
A Node
type which signifies either an Element
or a leaf node like Text
or a Comment
or even an unmatched opening/closing tag.
An Element
type which signifies an actual DOM Element with a name and a list/map of attributes as well as a list/vector of children.
I would really like to be able to pattern match on and inspect/print these types, it helps a lot with debugging and intuitiveness.
The exact details of this api and what's underneath it is not important, but basically anything that's easy/intuitive/debuggable with the above Api should ideally not have to be changed too significantly to work.
Text.HTML.TagSoup.Tree
is an example of an interface that would definitely work for this, as it meets all of the above requirement (technically one of the above types is a constructor but that works too). We just also want the various combinators provided by scalpel
for searching through these types quickly and easily. We also don't like the "preliminary" note at the top of that module.
I think that sounds doable. One area that I think still needs some consideration are the edge cases around how different types of malformed HTML are handled. We'd also need to ensure that we do not regress on performance, a lot of commits went into getting things running fast with the current data structures.
Another option to consider would be exposing a scraper that returns a list of TagSoup Tags. You could then combine this with Text.HTML.TagSoup.Tree
to get a tree structure to work with:
tags :: StringLike str => Selector -> Scraper str [Tag str]
tags = foldSpec (\tag s -> tag : s)
tree :: StringLike str => Selector -> Scraper str (TagTag str)
tree selector = tagTree <$> tags selector
I suspect the first approach would take a nontrivial amount of work and I don't have much bandwidth myself to work on this right now. However, I'd be happy to take a patch if you want to take it on.