lexborisov/myhtml

Multiple HTML space parsing

fariouche opened this issue · 4 comments

Hi,

I'm currently evaluating myhtmll. I like it.
I would like to use it to extract html texts. However, the HTML specification requires that the spaces are merged, and carriage return and line feed replaced by spaces.

I have a proposal patch for that. This is just a proposal, as this patch most likely requires more work so that this feature can be enabled or disabled.
I know that it can be done in my application, but doing it here will be much faster.

--- source/myhtml/mystring.c.orig 2018-01-04 15:30:36.000000000 +0100
+++ source/myhtml/mystring.c 2018-01-04 15:55:41.000000000 +0100
@@ -27,15 +27,18 @@

 unsigned char *data = (unsigned char*)str->data;
 const unsigned char *u_buff = (const unsigned char*)buff;
  • int num_spc = 0;

  • /* 0x0D == \r /
    /
    0x0A == \n */

    for (size_t i = 0; i < length; i++)
    {
    if(u_buff[i] == 0x0D) {

  •        data[str->length] = 0x0A;
    
  •        data[str->length] = ' ';
    
  •        num_spc = 1;
    
  •        str->length++;
    
  •        if((i + 1) < length) {
               if(u_buff[(i + 1)] == 0x0A)
                   i++;
    

@@ -49,6 +52,13 @@
return str->length;
}
}

  •    else if(u_buff[i] == 0x0A || u_buff[i] == ' ') {
    
  •        if(num_spc == 0) {
    
  •            data[str->length] = ' ';
    
  •            str->length++;
    
  •            num_spc = 1;
    
  •        }
    
  •    }
       else if(u_buff[i] == 0x00 && emit_null_chars == false)
       {
           mycore_string_realloc(str, (str->size + 5));
    

@@ -57,12 +67,16 @@
// Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
data[str->length] = 0xEF; str->length++;
data[str->length] = 0xBF; str->length++;

  •        data[str->length] = 0xBD;
    
  •        data[str->length] = 0xBD; str->length++;
       }
       else
    
  •    {
           data[str->length] = u_buff[i];
    
  •        num_spc = 0;
    
  •        str->length++;
    
  •    }
    
  •    str->length++;
    

    }

    str->data[str->length] = '\0';

Hi!

HTML rendering and parsing it is different. Show me where in the specification is written about this — "HTML specification requires that the spaces are merged, and carriage return and line feed replaced by spaces.". At the moment I can see the specification, but it seems to me that it's about rendering.

Thanks!

Yes, it is for rendering the html, not the parsing. Sorry I was not specific enough.
Anyway, the idea here is just to avoid parsing again all the texts and malloc etc.... Since myHtml is performance oriented, I feel that it was the right choice.

Or maybe we can override the function with a callback?

The point is that the CSS can change the behavior of the tags. At the time of parsing, we do not know what the user needs. In this case, you are asking for a fairly specific functional at the time of parsing. That is, change the behavior of the parser.
I'll think about how it can be done painlessly.
But, nevertheless, I would solve this problem in level above, not at the moment of parsing.

I'm not very familiar with CSS yet. I didn't know that it can change the text rendering of spaces and cariage returns.
In that case, you are right, better handle that at a higher level!
Thanks