JuliaWeb/Gumbo.jl

Tagged whitespace disappears

Closed this issue · 7 comments

See also Discourse thread. I strongly suspect this is based on the original Gumbo library's behavior (not having tested that), or maybe even is specified as part of the HTML5 parsing algorithm. In the latter case, I guess I'll just have to deal with it; in the former, if it's a bug, perhaps Gumbo.jl could still work around it somehow?

Anyway: The issue is that whitespace that is wrapped in tags disappears, contrary to how things are rendered in a browser, for example. In the following, I'm just using nodeText from Cascadia to extract the text; that may not be the best way to do it (and might even be related to the issue, though the whitespace does seem gone in the parsed HTML, too):

julia> using Gumbo, Cascadia

julia> x = parsehtml("<em>foo</em> bar<em> </em>baz")
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <em>
      foo
    </em>
    bar
    <em></em>
    baz
  </body>
</HTML>

julia> nodeText(x.root)
"foo barbaz"

Here I would have wished for "foo bar baz", which is what a browser would display. The whitespace is not stripped if there's some non-whitespace in there:

julia> nodeText(parsehtml("foo<em> bar </em>baz").root)
"foo bar baz"

(Of course, using em on whitespace doesn't make much sense; I've just come across it in the wild, and am losing spaces when scraping certain pages, needing to figure out a workaround that isn't too hacky.)

Hi, thanks for reporting!

I've figured out what causes this, it is something on Gumbo.jl's end, not the underlying gumbo C library.

The C library does indeed parse and create nodes for all whitespace, however, Gumbo filters them out, since for a lot of typical Julian use-cases (extracting structured data from webpages, etc.), they're just noise. For instance, in the following example:

<html>
  <head>
    <meta description="test page"></meta>
  </head>
  <body>
    <p>A simple test page.</p>
  </body>
</html>

If Gumbo didn't filter whitespace nodes, it would give the body element three children:

  1. an HTMLText node for the newline and indent before the p
  2. the p itself
  3. another HTMLText node the the newline and indent after the p

When I was originally writing the library, this all seemed like a bit much that most people wouldn't care about, so I decided to "filter out" the whitespace nodes (implicitly, by not including a branch for CGumbo.WHITESPACE here) and just leave the p.

I'd like to not change this as the default behavior, since I think this is still convenient and the right choice for the majority of use-cases.

How important is whitespace preservation for your use-case? I guess what I'm asking is, was this issue opened out of curiosity about why this doesn't work, or because you have a real need for whitespace preservation in an application? If the later, I can implement an optional keyword argument (preserve_whitespace=true or something like that) to parsehtml that would leave all the whitespace nodes intact, it would just be a bit of a pain to thread that option through all the parsing code, so I wanted to ask to make sure you really need it before I go to the trouble of writing all the code :)

Ah, I see! Well, it is an issue for a specific application of some screen scraping I’m doing at the moment – but it’s more of a one-off glitch because of weird markup in the source web pages (over which I have no direct control). But that’s not really enough to warrant adding special-casing/warts to Gumbo.jl :-) So I’d be okay with closing the issue, given this explanation.

This is an interesting special case, though; in many other cases, white space doesn’t disappear – it’s just normalized. Like between the words in <em>foo</em> <em>bar</em> (unless I’m mistaken). So, for example, when I extract the text from a piece of HTML, in general it works just fine, with no words jammed together – except in the case I describe (or, at least, that’s all I’ve observed). Not sure what conclusions to draw from that, though.

Anyway: I’m sure I’ll find ways around this for my current purposes.

@mlhetland OK, thanks! I'm going to close this issue in that case. If other people open issues with similar needs, then I'll implement the option to preserve whitespace nodes.

The reason why whitespace between words in a text node doesn't disappear, is because the gumbo C library (and maybe the HTML spec?) draws a distinction between text nodes that are only whitespace and text nodes that have any other characters in them. Gumbo.jl implicitly removes the former (for the convenience reasons I mentioned above).

Makes sense!

Or, hm. No, that’a what I assumed. But why isn’t the node with whitespace between the two element nodes in my example removed? I assume that’s not a separate whitespace node, but stored differently?

I'm confused what you mean by "the node with whitespace between the two element nodes in my example". Is the example you're talking about <em>foo</em> <em>bar</em>? The whitespace is removed for me there too:

julia> using Gumbo, Cascadia

julia> nodeText(parsehtml("<em>foo</em> <em>bar</em>").root)
"foobar"

Indeed – my mistake. I guess in my application I just didn't have any adjacent marked-up pieces of text, so this wasn't an issue. It does make this less of a marginal issue, though, as this kind of thing normally does occur in marked-up text, and dropping whitespace nodes will then garble the text. Even if you don't want to extract the plain text directly like this, as long as you're treating the document as marked-up text (rather than extracting values or something) – e.g., converting to Markdown or LaTeX or rewriting the HTML (as with XSLT) or whatever – the results will be wrong. In this case, for example, rewriting <em>foo</em> <em>bar</em> to LaTeX would yield \emph{foo}\emph{bar}.

Still, whether this is a problem (and, if so, whether it's a problem important enough to address) depends on the intended use-cases for Gumbo.jl, of course. (And, I guess, the amount of work involved in adding a whitespace preservation switch, or the like :-))

It's not crucial to me, at the moment. And anyway, a possible workaround could be something like using replacing r"(\s+)" with s"⟦\1⟧" before parsing, and then replacing r"⟦(\s+)⟧" with s"\1" in the final text. (Here and are arbitrary markers, of course.) That should make any whitespace be treated like any other text. (One might want to restrict which whitespace is replaced as well, I guess, to only that between > and <, as in the suggestion by @Nosferican.)