Multiple HTML space parsing

Question

Multiple HTML space parsing

fariouche opened this issue 7 years ago · 4 comments

Hi,

I'm currently evaluating myhtmll. I like it.
I would like to use it to extract html texts. However, the HTML specification requires that the spaces are merged, and carriage return and line feed replaced by spaces.

I have a proposal patch for that. This is just a proposal, as this patch most likely requires more work so that this feature can be enabled or disabled.
I know that it can be done in my application, but doing it here will be much faster.

--- source/myhtml/mystring.c.orig 2018-01-04 15:30:36.000000000 +0100
+++ source/myhtml/mystring.c 2018-01-04 15:55:41.000000000 +0100
@@ -27,15 +27,18 @@

 unsigned char *data = (unsigned char*)str->data;
 const unsigned char *u_buff = (const unsigned char*)buff;

int num_spc = 0;
/* 0x0D == \r /
/ 0x0A == \n */

for (size_t i = 0; i < length; i++)
{
if(u_buff[i] == 0x0D) {

```
       data[str->length] = 0x0A;
```

```
       data[str->length] = ' ';
```
```
       num_spc = 1;
```
```
       str->length++;
```

       if((i + 1) < length) {
           if(u_buff[(i + 1)] == 0x0A)
               i++;

@@ -49,6 +52,13 @@
return str->length;
}
}

   else if(u_buff[i] == 0x0A || u_buff[i] == ' ') {

```
       if(num_spc == 0) {
```
```
           data[str->length] = ' ';
```
```
           str->length++;
```
```
           num_spc = 1;
```
```
       }
```

   }
   else if(u_buff[i] == 0x00 && emit_null_chars == false)
   {
       mycore_string_realloc(str, (str->size + 5));

@@ -57,12 +67,16 @@
// Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
data[str->length] = 0xEF; str->length++;
data[str->length] = 0xBF; str->length++;

```
       data[str->length] = 0xBD;
```

       data[str->length] = 0xBD; str->length++;
   }
   else

   {
       data[str->length] = u_buff[i];

```
       num_spc = 0;
```
```
       str->length++;
```
```
   }
```

```
   str->length++;
```
}

str->data[str->length] = '\0';

Answer 1 · 2018-01-04T18:32:13.000Z

Hi!

HTML rendering and parsing it is different. Show me where in the specification is written about this — "HTML specification requires that the spaces are merged, and carriage return and line feed replaced by spaces.". At the moment I can see the specification, but it seems to me that it's about rendering.

Thanks!

Answer 2 · 2018-01-04T19:41:18.000Z

Yes, it is for rendering the html, not the parsing. Sorry I was not specific enough.
Anyway, the idea here is just to avoid parsing again all the texts and malloc etc.... Since myHtml is performance oriented, I feel that it was the right choice.

Or maybe we can override the function with a callback?

Answer 3 · 2018-01-04T19:54:08.000Z

The point is that the CSS can change the behavior of the tags. At the time of parsing, we do not know what the user needs. In this case, you are asking for a fairly specific functional at the time of parsing. That is, change the behavior of the parser.
I'll think about how it can be done painlessly.
But, nevertheless, I would solve this problem in level above, not at the moment of parsing.

Answer 4 · 2018-01-05T14:29:32.000Z

I'm not very familiar with CSS yet. I didn't know that it can change the text rendering of spaces and cariage returns.
In that case, you are right, better handle that at a higher level!
Thanks