lexborisov/myhtml

Script Tags Cause Incorrect Chunked Parsing With Closing Body and HTML Tags

schrodingersket opened this issue · 2 comments

I've noticed that the inclusion of script tags along with closing body or html tags causes the resulting HTML document to be malformed when parsed. Without script tags or when body and html closing tags are omitted, parsing occurs as expected. A slight modification of the HTML chunks in chunks_high_level.c illustrates the issue:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <myhtml/api.h>

mystatus_t serialization_callback(const char* data, size_t len, void* ctx)
{
    printf("%.*s", (int)len, data);
    return MyCORE_STATUS_OK;
}

int main(int argc, const char * argv[])
{
    char html[][64] = {
            "<!DOCT",
            "YPE htm",
            "l>",
            "<html><head>",
            "<script>console.log('Hello, world!');</script>",
            "<ti",
            "tle>HTML chun",
            "ks parsing</",
            "title>",
            "</head><bod",
            "y><div cla",
            "ss=",
            "\"bestof",
            "class",
            "\">",
            "good for me",
            "</div>",
            "</body>",
            "</html>",
        "\0"
    };
    
    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    
    // init tree
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);
    
    myhtml_encoding_set(tree, MyENCODING_UTF_8);
    
    for(size_t i = 0; html[i][0]; i++)
    {
        printf("Parse chunk: %s\n", html[i]);
        
        // parse html
        myhtml_parse_chunk(tree, html[i], strlen(html[i]));
    }
    
    // call to the end
    myhtml_parse_chunk_end(tree);
    
    // print fragment
    myhtml_serialization_tree_callback(myhtml_tree_get_document(tree), serialization_callback, NULL);
    
    // release resources
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);
    
    return 0;
}

This outputs the following:

Parse chunk: <!DOCT
Parse chunk: YPE htm
Parse chunk: l>
Parse chunk: <html><head>
Parse chunk: <script>console.log('Hello, world!');</script>
Parse chunk: <ti
Parse chunk: tle>HTML chun
Parse chunk: ks parsing</
Parse chunk: title>
Parse chunk: </head><bod
Parse chunk: y><div cla
Parse chunk: ss=
Parse chunk: "bestof
Parse chunk: class
Parse chunk: ">
Parse chunk: good for me
Parse chunk: </div>
Parse chunk: </body>
Parse chunk: </html>
<!DOCTYPE html><html><head><script>console.log('Hello, world!');</script><title>HTML chunks parsing</title></head><body><div class="bestofclass">good for me</div></body></html></script></head><body></body></html>
Process finished with exit code 0

You'll notice the extraneous </script></head><body></body></html> string at the end, as though the initial script tag was never closed.

It's also quite possible that I'm misunderstanding how myhtml_parse_chunk is supposed to be used - if so, clarification would be greatly appreciated.

Thanks in advance for your time and attention!

Hi @schrodingersket
Sorry that I did not reply for a while. The problem is resolved.

Thank you!

Fantastic - thank you so much!