`.getOuterHTML` produces invalid HTML when attributes contain double-quotes
laurenkt opened this issue · 3 comments
Example:
var DomUtils = require('domutils')
var dom = {
type: "tag",
name: "div",
next: null,
prev: null,
parent: null,
attribs: {
'foo': 'bar with "quotes"'
}
}
console.log(DomUtils.getOuterHTML(dom))
Outputs: <div foo="bar with "quotes""></div>
Which is invalid HTML (and a code injection risk).
Here's an example that shows a more complete example of the problem when using in conjunction with htmlparser2:
var DomUtils = require('domutils')
var htmlparser2 = require('htmlparser2')
var handler = new htmlparser2.DomHandler(function(err, dom) {
console.log(DomUtils.getOuterHTML(dom))
})
var parser = new htmlparser2.Parser(handler)
parser.write(`
<!DOCTYPE html>
<title>Example</title>
<p style='font-family: "Times New Roman"'>Test</p>
`)
parser.done()
Which outputs:
<!DOCTYPE html>
<title>Example</title>
<p style="font-family: "Times New Roman"">Test</p>
According to the HTML5 specification – and the W3 validator – this is valid input code, but when processed by these two tools with their default options, malformed HTML is produced.
Workaround:
getOuterHTML
is to be implemented from cheeriojs/dom-serializer and allows options to be passed in. {decodeEntities: true}
will encode the quotes. This works for my case but it seems like if this is a necessary step to produce valid HTML it should be the default behaviour – or at the very least the default should be encoding what entities are necessary to avoid producing dangerous and malformed data-structures.
getOuterHTML
(and getInnerHTML
) should both be considered to be deprecated in favour of dom-serializer. Thanks for opening this!
getOuterHTML
is now using dom-serializer
, which should have fixed this issue.
Is there any API to also set the innerHTML?