<title> tag content will never decode html entities
cheeseandcereal opened this issue · 4 comments
0189e56 introduced a bug where html entities inside <title> tags are not decoded, even when the parser is set to decode entities.
Example test code:
// test.js
const htmlparser2 = require("./lib/index");
const parser = new htmlparser2.Parser({
ontext(text) {
console.log("-->", text);
},
}, { decodeEntities: true });
parser.write("<title>my"title"");
parser.end();
Before 0189e56:
$ node test.js
--> my
--> "
--> title
--> "
After 0189e56:
$ node test.js
--> my"title"
This appears to be the result of trying to fix #482, however html entity decoding should still occur inside of title tags (even if additional html tags before the closing </title> should not)
Thanks for the report! This is definitely an issue.
@billneff79 As the author of the change in question — do you have bandwidth to address this?
I'm not quite sure this is a bug. The contents of a <title>
tag are always a text node and thus are never HTML encoded in the first place per the HTML specification. The existence of something that looks like an HTML encoding, e.g. "
will always be the literal string "
under the HTML specification, and never an encoded "
I guess the question is, what is the purpose of the decodeEntities
flag? If It is to decode things so that they look how a user perceive them in their browser, then the current implementation is correct: <title>my"title"</title>
as a tag would appear in a browser tab to the user as the string my"title"
. If it is to always decode the entities inside of a tag, regardless of whether that tag can hold HTML encoded content, then you also have a bug with your <script>
and <style>
tags as those likely aren't being decoded either, but nor should they be.
I'm continuing to noodle and play with this a bit more - I think the submitter of the issue is correct. Out of necessity the browser has to protect against a closing </title>
tag in the text content, and thus does encode/decode >
and <
, and also encodes &
to &
. I was interpreting the spec incorrectly.
I'll try to get a PR together to fix the decoding issue. Thanks for reporting @cheeseandcereal
$ curl -s "https://www.google.com/search?q=owari+no+seraph+shinoa" -A "kana/2.0 (node-fetch) like Twitterbot/1.0" | grep -o "<title>[^<]*"
<title>owari no seraph shinoa - Поиск в Google