JuliaWeb/Gumbo.jl

Implement pretty printing of elements and documents

Closed this issue · 9 comments

I'm trying to read an html file and change a specific value based on a julia script that is running. I figured that I would be able to modify the contents of the html file with this package, but I'm a little unsure of how to go about it.
Ideally, I'd be able to use the tree traversal to locate the specific element I want to modify, but after parsing the file and trying to do this a few times it seems that if I modify the element it isn't stored because it is only referenced to the document not the document itself.
My second option is to know the absolute position of the element in the file (which isn't hard write now as it is still small), and this seems like it should work.
However, once I get past that step I'm a little unsure of how to go about saving the modifications to the document back to the original file.

This is a library for html parsing; modifying the contents of an html file in place sounds like a thorny problem and is definitely out of scope for this project. I would suggest parsing the file, making your changes to the resulting html tree in memory, then just writing that tree back out to the file, replacing the old contents (which you can do with the print(io, htmldocument)) method.

I think that is what I was meaning to say for my overall goal. And I can get that to work with print(io, htmldoc.root), but currently I'm getting a stackoverflow when trying to use print(io, htmldocument).

As for my ideal scenario of traversing the tree and modifying the document tree based on the element and not absolute position I added in the pop! method to remove the element so it can be replaced with a push! of what I want.

I think I found the cause of that stack overflow, should be fixed on master if you Pkg.checkout again.

Stack overflow was fixed. Only other question I have related is: is there any way to to print to io with newlines, tabs, etc.?

Not yet, no. show already implements this kind of crappily, but also writes some julia specific information at the top. The prettyprinting should probably be extracted out into a prettyprint function which show can then call.

I see what you mean with the julia info with show. I just started looking at PrettyPrint.jl but am a little confused as to how this could be utilized.

@kbuzby OK, I found some time to implement nicer printing (on the prettyprint) branch. You should be able to try it out with Pkg.checkout("Gumbo", "prettyprint"). For example:

julia> Pkg.checkout("Gumbo", "prettyprint")
INFO: Checking out Gumbo prettyprint...
INFO: Pulling Gumbo latest prettyprint...
INFO: No packages to install, update or remove

julia> using Gumbo

julia> res = Gumbo.parsehtml("<h1>Hello</h1><div><p>hi</p><p>whee!</p></div>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:
<HTML>
  <head></head>
  <body>
    <h1>Hello</h1>
    <div>
      <p>hi</p>
      <p>whee!</p>
    </div>
  </body>
</HTML>


julia> prettyprint(STDOUT, res)
<!DOCTYPE ><HTML>
  <head></head>
  <body>
    <h1>Hello</h1>
    <div>
      <p>hi</p>
      <p>whee!</p>
    </div>
  </body>
</HTML>

prettyprint(io, element) and prettyprint(io, document) are the methods you'll be interested in. The print methods also now take a pretty keyword argument that has the same effect.

Mind trying it out and letting me know if it works for you?

Just tried it out and works just as I'd hoped. It's great! Thanks!

@kbuzby I cut a release last night so you should be able to just do Pkg.update() now instead of having to use master :)