Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`)
KillyMXI opened this issue · 6 comments
Input:
import { parseDocument } from 'htmlparser2';
const document = parseDocument(
'<style>a{}</style>'<br/>',
{ decodeEntities: true }
);
console.log(document);
Observed output:
<ref *1> Document {
type: 'root',
parent: null,
prev: null,
next: null,
startIndex: null,
endIndex: null,
children: [
Element {
type: 'style',
parent: [Circular *1],
prev: null,
next: [Text],
startIndex: null,
endIndex: null,
children: [Array],
name: 'style',
attribs: {}
},
Text {
type: 'text',
parent: [Circular *1],
prev: [Element],
next: null,
startIndex: null,
endIndex: null,
data: "'<br/>"
}
]
}
Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".
When changed to <style>a{}</style>\'<br/>
or <style>a{}</style><br/>'<br/>
- it works as expected.
When decodeEntities
is set to false
- it works as expected.
Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.
First reported by @galenhuntington in html-to-text/node-html-to-text#285
tokenize("<style>a{}</style>'<br/>")
Expand
[
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag",
12,
17,
],
[
"ontextentity",
39,
],
[
"ontext", // just text
24,
29,
],
[
"onend",
],
]
tokenize("<style>a{}</style><br/>'<br/>")
Expand
[
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag",
12,
17,
],
[
"onopentagname",
19,
21,
],
[
"onselfclosingtag",
22,
],
[
"ontextentity",
39,
],
[
"onopentagname", // tag, as expected
30,
32,
],
[
"onselfclosingtag",
33,
],
[
"onend",
],
]
So the issue is in Tokenizer.
I tried to step through:
- while Tokenizer has state = State.InSpecialTag (24), it also has baseState = State.InSpecialTag (24);
- when the special tag ends, state is reset to State.Text (1), but baseState is left unchanged;
- following named entity processing doesn't affect baseState but does reset the state to this erroneous baseState in the end;
Not sure if this is the cause but it looks suspicious.
--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
private stateAfterClosingTagName(c: number): void {
// Skip everything until ">"
if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
this.state = State.Text;
+ this.baseState = State.Text;
this.sectionStart = this.index + 1;
}
}
[
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag", // closed style tag
12,
17,
],
[
"ontextentity", // entity
39,
],
[
"onopentagname", // following tag parsed properly
25,
27,
],
[
"onselfclosingtag",
28,
],
[
"onend",
],
]
All existing tests still passing.
This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this.
Thanks for the report, and awesome job figuring this one out!
Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues.
I mean, locating the spec file is easy, describing the test requires more effort :)
Ok, decided on the description.