fb55/htmlparser2

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`)

KillyMXI opened this issue · 6 comments

Input:

import { parseDocument } from 'htmlparser2';

const document = parseDocument(
  '<style>a{}</style>&apos;<br/>',
  { decodeEntities: true }
);

console.log(document);

Observed output:

<ref *1> Document {
  type: 'root',
  parent: null,
  prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [
    Element {
      type: 'style',
      parent: [Circular *1],
      prev: null,
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'style',
      attribs: {}
    },
    Text {
      type: 'text',
      parent: [Circular *1],
      prev: [Element],
      next: null,
      startIndex: null,
      endIndex: null,
      data: "'<br/>"
    }
  ]
}

Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".

When changed to <style>a{}</style>\'<br/> or <style>a{}</style><br/>&apos;<br/> - it works as expected.

When decodeEntities is set to false - it works as expected.

Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.

First reported by @galenhuntington in html-to-text/node-html-to-text#285

tokenize("<style>a{}</style>&apos;<br/>")
Expand
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "ontext", // just text
    24,
    29,
  ],
  [
    "onend",
  ],
]
tokenize("<style>a{}</style><br/>&apos;<br/>")
Expand
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "onopentagname",
    19,
    21,
  ],
  [
    "onselfclosingtag",
    22,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "onopentagname", // tag, as expected
    30,
    32,
  ],
  [
    "onselfclosingtag",
    33,
  ],
  [
    "onend",
  ],
]

So the issue is in Tokenizer.

I tried to step through:

  • while Tokenizer has state = State.InSpecialTag (24), it also has baseState = State.InSpecialTag (24);
  • when the special tag ends, state is reset to State.Text (1), but baseState is left unchanged;
  • following named entity processing doesn't affect baseState but does reset the state to this erroneous baseState in the end;

Not sure if this is the cause but it looks suspicious.

--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
     private stateAfterClosingTagName(c: number): void {
         // Skip everything until ">"
         if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
             this.state = State.Text;
+            this.baseState = State.Text;
             this.sectionStart = this.index + 1;
         }
     }
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag", // closed style tag
    12,
    17,
  ],
  [
    "ontextentity", // entity
    39,
  ],
  [
    "onopentagname", // following tag parsed properly
    25,
    27,
  ],
  [
    "onselfclosingtag",
    28,
  ],
  [
    "onend",
  ],
]

All existing tests still passing.

This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this.

fb55 commented

Thanks for the report, and awesome job figuring this one out!

Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues.

I mean, locating the spec file is easy, describing the test requires more effort :)

Ok, decided on the description.

fb55 commented

Fixed in #1460.