Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`)

Question

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`)

KillyMXI opened this issue 2 years ago · 6 comments

Input:

import { parseDocument } from 'htmlparser2';

const document = parseDocument(
  '<style>a{}</style>&apos;<br/>',
  { decodeEntities: true }
);

console.log(document);

Observed output:

<ref *1> Document {
  type: 'root',
  parent: null,
  prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [
    Element {
      type: 'style',
      parent: [Circular *1],
      prev: null,
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'style',
      attribs: {}
    },
    Text {
      type: 'text',
      parent: [Circular *1],
      prev: [Element],
      next: null,
      startIndex: null,
      endIndex: null,
      data: "'<br/>"
    }
  ]
}

Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".

When changed to <style>a{}</style>\'<br/> or <style>a{}</style><br/>'<br/> - it works as expected.

When decodeEntities is set to false - it works as expected.

Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.

First reported by @galenhuntington in html-to-text/node-html-to-text#285

Answer 1 · 2023-03-21T15:50:34.000Z

tokenize("<style>a{}</style>&apos;<br/>")

Expand

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "ontext", // just text
    24,
    29,
  ],
  [
    "onend",
  ],
]

tokenize("<style>a{}</style><br/>&apos;<br/>")

Expand

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "onopentagname",
    19,
    21,
  ],
  [
    "onselfclosingtag",
    22,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "onopentagname", // tag, as expected
    30,
    32,
  ],
  [
    "onselfclosingtag",
    33,
  ],
  [
    "onend",
  ],
]

So the issue is in Tokenizer.

I tried to step through:

while Tokenizer has state = State.InSpecialTag (24), it also has baseState = State.InSpecialTag (24);
when the special tag ends, state is reset to State.Text (1), but baseState is left unchanged;
following named entity processing doesn't affect baseState but does reset the state to this erroneous baseState in the end;

Not sure if this is the cause but it looks suspicious.

Answer 2 · 2023-03-22T15:54:41.000Z

--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
     private stateAfterClosingTagName(c: number): void {
         // Skip everything until ">"
         if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
             this.state = State.Text;
+            this.baseState = State.Text;
             this.sectionStart = this.index + 1;
         }
     }

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag", // closed style tag
    12,
    17,
  ],
  [
    "ontextentity", // entity
    39,
  ],
  [
    "onopentagname", // following tag parsed properly
    25,
    27,
  ],
  [
    "onselfclosingtag",
    28,
  ],
  [
    "onend",
  ],
]

All existing tests still passing.

This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this.

Answer 3 · 2023-03-22T16:01:32.000Z

Thanks for the report, and awesome job figuring this one out!

Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues.

Answer 4 · 2023-03-22T16:04:32.000Z

I mean, locating the spec file is easy, describing the test requires more effort :)

Answer 5 · 2023-03-22T16:06:40.000Z

Ok, decided on the description.

Answer 6 · 2023-03-22T23:34:00.000Z

Fixed in #1460.