Script Tags Cause Incorrect Chunked Parsing With Closing Body and HTML Tags
schrodingersket opened this issue · 2 comments
I've noticed that the inclusion of script
tags along with closing body
or html
tags causes the resulting HTML document to be malformed when parsed. Without script
tags or when body
and html
closing tags are omitted, parsing occurs as expected. A slight modification of the HTML chunks in chunks_high_level.c illustrates the issue:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <myhtml/api.h>
mystatus_t serialization_callback(const char* data, size_t len, void* ctx)
{
printf("%.*s", (int)len, data);
return MyCORE_STATUS_OK;
}
int main(int argc, const char * argv[])
{
char html[][64] = {
"<!DOCT",
"YPE htm",
"l>",
"<html><head>",
"<script>console.log('Hello, world!');</script>",
"<ti",
"tle>HTML chun",
"ks parsing</",
"title>",
"</head><bod",
"y><div cla",
"ss=",
"\"bestof",
"class",
"\">",
"good for me",
"</div>",
"</body>",
"</html>",
"\0"
};
// basic init
myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
// init tree
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_encoding_set(tree, MyENCODING_UTF_8);
for(size_t i = 0; html[i][0]; i++)
{
printf("Parse chunk: %s\n", html[i]);
// parse html
myhtml_parse_chunk(tree, html[i], strlen(html[i]));
}
// call to the end
myhtml_parse_chunk_end(tree);
// print fragment
myhtml_serialization_tree_callback(myhtml_tree_get_document(tree), serialization_callback, NULL);
// release resources
myhtml_tree_destroy(tree);
myhtml_destroy(myhtml);
return 0;
}
This outputs the following:
Parse chunk: <!DOCT
Parse chunk: YPE htm
Parse chunk: l>
Parse chunk: <html><head>
Parse chunk: <script>console.log('Hello, world!');</script>
Parse chunk: <ti
Parse chunk: tle>HTML chun
Parse chunk: ks parsing</
Parse chunk: title>
Parse chunk: </head><bod
Parse chunk: y><div cla
Parse chunk: ss=
Parse chunk: "bestof
Parse chunk: class
Parse chunk: ">
Parse chunk: good for me
Parse chunk: </div>
Parse chunk: </body>
Parse chunk: </html>
<!DOCTYPE html><html><head><script>console.log('Hello, world!');</script><title>HTML chunks parsing</title></head><body><div class="bestofclass">good for me</div></body></html></script></head><body></body></html>
Process finished with exit code 0
You'll notice the extraneous </script></head><body></body></html>
string at the end, as though the initial script
tag was never closed.
It's also quite possible that I'm misunderstanding how myhtml_parse_chunk
is supposed to be used - if so, clarification would be greatly appreciated.
Thanks in advance for your time and attention!
Hi @schrodingersket
Sorry that I did not reply for a while. The problem is resolved.
Thank you!
Fantastic - thank you so much!