fb55/domutils

`.getOuterHTML` produces invalid HTML when attributes contain double-quotes

laurenkt opened this issue · 3 comments

Example:

var DomUtils = require('domutils')

var dom = {
	type: "tag",
	name: "div",
	next: null,
	prev: null,
	parent: null,
	attribs: {
		'foo': 'bar with "quotes"'
	}
}

console.log(DomUtils.getOuterHTML(dom))

Outputs: <div foo="bar with "quotes""></div>

Which is invalid HTML (and a code injection risk).

Here's an example that shows a more complete example of the problem when using in conjunction with htmlparser2:

var DomUtils = require('domutils')
var htmlparser2 = require('htmlparser2')

var handler = new htmlparser2.DomHandler(function(err, dom) {
	console.log(DomUtils.getOuterHTML(dom))
})
var parser = new htmlparser2.Parser(handler)
parser.write(`
<!DOCTYPE html>
<title>Example</title>
<p style='font-family: "Times New Roman"'>Test</p>
`)
parser.done()

Which outputs:

<!DOCTYPE html>
<title>Example</title>
<p style="font-family: "Times New Roman"">Test</p>

According to the HTML5 specification – and the W3 validator – this is valid input code, but when processed by these two tools with their default options, malformed HTML is produced.

Workaround:

getOuterHTML is to be implemented from cheeriojs/dom-serializer and allows options to be passed in. {decodeEntities: true} will encode the quotes. This works for my case but it seems like if this is a necessary step to produce valid HTML it should be the default behaviour – or at the very least the default should be encoding what entities are necessary to avoid producing dangerous and malformed data-structures.

fb55 commented

getOuterHTML (and getInnerHTML) should both be considered to be deprecated in favour of dom-serializer. Thanks for opening this!

fb55 commented

getOuterHTML is now using dom-serializer, which should have fixed this issue.

Is there any API to also set the innerHTML?