google/gumbo-parser

handle 'noscript' in 'head' maybe wrong

hjcapple opened this issue · 9 comments

I run the examples/prettyprint, try to handle the html of this link

http://www.typeisbeautiful.com/2014/08/8243/

I open the output html in browser and find some noisy text on top.

And then I check the output html, found the example maybe wrong to handle 'noscript' in 'head'.

I try to fix it.

And in function handle_in_head_noscript I change

if (token->type == GUMBO_TOKEN_WHITESPACE ||
    token->type == GUMBO_TOKEN_COMMENT ||

to

if (token->type == GUMBO_TOKEN_WHITESPACE ||
    token->type == GUMBO_TOKEN_COMMENT ||
    token->type == GUMBO_TOKEN_CHARACTER ||

And in function handle_in_head I change

if (token->type == GUMBO_TOKEN_WHITESPACE) 

to

if (token->type == GUMBO_TOKEN_WHITESPACE || token->type == GUMBO_TOKEN_CHARACTER)

After change, the output is right. But I don't know if I'm right or not. Sorry for my poor English.

@hjcapple fyi- the prettyprint and serialize code here are just an examples of how these things might be done. They are not meant to be definitive. So it is probably much more likely that prettyprint has a bug/problem here than the underlying gumbo-parser itself.

@hjcapple - your link has both comments and CDATA in it, neither the current serialize nor the current prettyprint code will handle these cases. That link also has an xhtml doctype which is quite strange. Both serialize and prettyprint would have to be modified to support those gumbo node classes. If that link's doctype of xhtml is correct and it uses any non-void self-closing tags, that will also be an issue since gumbo is an html parser not an xhtml parser.

FWIW, I will generate a pull request to update both serialize and prettyprint to later/better versions I have in my own tree.

If that link's doctype of xhtml is correct ...

It's not being served with an XHTML content type (which is how UAs are supposed to determine which parser to use):

$ curl -s -D - http://www.typeisbeautiful.com/2014/08/8243/ | grep 'Content-Type'
Content-Type: text/html; charset=UTF-8
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

That is good to know. I still would place my bets that the problem is inside prettyprint and not in gumbo, but I can not seem to recreate any issue with noscript tags. I see them in that site's html but as far as I can tell they are untouched on output. If I pass the output of my current prettyprint into a web browser I observe no changes. So I am guessing it is the comments or cdata nodes not being handled in the official serialize or prettyprint. I will build official gumbo and see what it shows.

I checked with prettyprint from gumbo master, and could see no issues. The CDATA usages were inside script elements so they were not touched. I then the output into Firefox browser and could see no issues.

I checked against the spec; it's correct for Gumbo to only handle whitespace and selected start tags as if they were in_head, and the handle_in_head clause should be only whitespace.

Could you post the output you're getting from prettyprint? I'm curious exactly what noisy text you're seeing, and where it may be coming from.

I may be something wrong.

I'm writing an iPhone Safari Extension. And I using Safari to open the link. And then get the webpage html using Javascript code 'document.documentElement.outerHTML'.

I found the result has a little difference with the raw html from the link.

the raw html is

<noscript><link rel="stylesheet" type="text/css"

but the output html got from Javascript is

<noscript>&lt;link rel="stylesheet" type="text/css"

And I using gumbo to parse the output html from Javascript, do something and convert to html agian. And see noisy text.
noisy_text

Using prettyprint to test the output html from Javascript.
noisy_text_2

The zip includes three files.

raw_html_from_link.html
using curl to get raw html from the link.

ouput_html_from_js.html
read the raw html using iPhone Safari and output using Javascript.

ouput_html_prettyprint.html
use prettyprint to handle ouput_html_from_js.html

html.zip

I think the behavior you're seeing is because of the iPhone Safari Extension. When I run prettyprint on raw_html_from_link.html, there's no extra text, and everything looks identical to the page.

The iPhone Safari Extension browses with Javascript enabled. When JS is enabled in the user agent, the spec says to treat everything inside a tag as raw text, i.e. the "<link rel="stylesheet" type="text/css"" is literally that text in the parse tree, and doesn't parse as an element. When you output that from the JS, it's already escaped as <link rel=... - you can see that in output_html_from_js.html. However, since browsers with JS enabled ignore the contents of noscript tags, it isn't rendered by the browser.

When you run it through prettyprint, it's just literal text. Literal text inside a noscript tag in the head gets handled as if it were directly in the head element - but it's not legal there, and so the head element is closed and the text is moved to the head of the body element, where it shows up in the document.

Thank you. I think that's where the problem is.
I changed my way. When get html from Safari Extension, I skip all 'noscript' tags.