fb55/htmlparser2

<title> tag content will never decode html entities

cheeseandcereal opened this issue · 4 comments

0189e56 introduced a bug where html entities inside <title> tags are not decoded, even when the parser is set to decode entities.

Example test code:

// test.js
const htmlparser2 = require("./lib/index");
const parser = new htmlparser2.Parser({
    ontext(text) {
        console.log("-->", text);
    },
}, { decodeEntities: true });
parser.write("<title>my&quot;title&quot;");
parser.end();

Before 0189e56:

$ node test.js
--> my
--> "
--> title
--> "

After 0189e56:

$ node test.js
--> my&quot;title&quot;

This appears to be the result of trying to fix #482, however html entity decoding should still occur inside of title tags (even if additional html tags before the closing </title> should not)

fb55 commented

Thanks for the report! This is definitely an issue.

@billneff79 As the author of the change in question — do you have bandwidth to address this?

I'm not quite sure this is a bug. The contents of a <title> tag are always a text node and thus are never HTML encoded in the first place per the HTML specification. The existence of something that looks like an HTML encoding, e.g. &quot; will always be the literal string &quot; under the HTML specification, and never an encoded "

I guess the question is, what is the purpose of the decodeEntities flag? If It is to decode things so that they look how a user perceive them in their browser, then the current implementation is correct: <title>my&quot;title&quot;</title> as a tag would appear in a browser tab to the user as the string my&quot;title&quot;. If it is to always decode the entities inside of a tag, regardless of whether that tag can hold HTML encoded content, then you also have a bug with your <script> and <style> tags as those likely aren't being decoded either, but nor should they be.

I'm continuing to noodle and play with this a bit more - I think the submitter of the issue is correct. Out of necessity the browser has to protect against a closing </title> tag in the text content, and thus does encode/decode &gt; and &lt;, and also encodes & to &amp;. I was interpreting the spec incorrectly.

I'll try to get a PR together to fix the decoding issue. Thanks for reporting @cheeseandcereal

$ curl -s "https://www.google.com/search?q=owari+no+seraph+shinoa" -A "kana/2.0 (node-fetch) like Twitterbot/1.0" | grep -o "<title>[^<]*"
<title>owari no seraph shinoa - &#1055;&#1086;&#1080;&#1089;&#1082; &#1074; Google