lexborisov/Modest

Handling of malformed iframe tags

Opened this issue · 5 comments

I've noticed a pretty annoying problem on some websites (I think there are at least a thousand of them in Alexa 1M).

An unclosed Iframe tag breaks all the HTML below it.

Here is an example:

<noscript>
    <iframe
            height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW" class="lazyload"
            src="">
        <noscript>
            <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW"
                    height="0" width="0">
        </noscript>
    </iframe>
</noscript>

It's missing the closing iframe tag but still works when parsing it using Modest.

But for some reason, if you open it in Chrome (to render the javascript parts) and dump HTML, you get this:

<noscript>
<iframe 
      height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=" class="lazyload"
       src="">
       <noscript>
        <iframe src="https://www.googletagmanager.com/ns.html?id=" height="0" width="0">
</noscript>

Now there are no closing tags for both iframes.

The problem with this is that Modest will ignore everything after such a tag:

<noscript>
<iframe data-src="https://www.googletagmanager.com/ns.html?id=">
</noscript>


<script></script>
<script></script>
<script></script>

Seaching for script nodes using myhtml_get_nodes_by_name or using CSS selectors returns no results.

@lexborisov Are there any ways to improve this? Other parsers can still handle this.

@rushter try https://github.com/lexbor/lexbor

I maintain a Python binding for Modest, lexbor is not ready to be replaced yet.

any updates on this?

Hi @rushter @omerh2802

Maybe we should tell the parser that we have enabled SCRIPT?
Then everything in the noscript tag will be treated as text. It won't affect anything else.

tree->flags |= MyHTML_TREE_FLAGS_SCRIPT

    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);

    tree->flags |= MyHTML_TREE_FLAGS_SCRIPT;

    myhtml_parse(tree, MyENCODING_UTF_8, html, length);

Hi @lexborisov, thats sound like a great idea!