fb55/htmlparser2

Carefully constructed markup sneaks tags through as "text"

boutell opened this issue · 6 comments

This code:

<<img src="javascript:evil"/>img src="javascript:evil"/>

Results in the following sequence of onopentag/ontext/onclosetag events:

text: < open: img (with the expected src attribute) close: img text: img src="javascript:evil"/>

Since the sanitize-html module trusts "text" coming from htmlparser2, and outputs it without further escaping (because htmlparser2 does not decode entities in text before delivering it), this results in an XSS attack vector if sanitize-html ignores the img tag (according to user-configured filter rules) but passes the text intact, as it must do to keep any text in documents.

I have verified that the bug still exists as of version 3.7.3.

👍

fb55 commented

Sorry for the late response.

@boutell You can enable entity decoding using decodeEntities: true. Anyway, that looks like a bug, maybe switching to the tokenizer of high5 fixes it. Won't be fixed anytime soon though.

Thanks for the update. A pity it won't be fixed soon, but hey, we're not
paying you to fix it.

On Mon, Oct 20, 2014 at 2:44 PM, Felix Böhm notifications@github.com
wrote:

Sorry for the late response.

@boutell https://github.com/boutell You can enable entity decoding
using decodeEntities: true. Anyway, that looks like a bug, maybe
switching to the tokenizer of high5 fixes it. Won't be fixed anytime soon
though.


Reply to this email directly or view it on GitHub
#105 (comment).

*THOMAS BOUTELL, *DEV & OPS
P'UNK AVENUE | (215) 755-1330 | punkave.com

fb55 commented

As this seems to be confusing for a lot of people: This is not a vulnerability, but instead a bug in @boutell's module. The behavior is in-line with the HTML spec (I wasn't sure about it in my previous comment).

Entities aren't decoded by default, only not to break backwards compatibility, but will be in the next major release (which will mainly consist of #114, I only need to take a day and add positional support to high5). It is recommended to always decoded entities, then use eg. entities to encode them again. Of course this has a performance penalty, but it eliminates this risk.

For now, I'll add a note to the wiki page recommending to always enable decodeEntities, which is pretty much everything that can be done here.

OK, version 1.5.1 of sanitize-html uses decodeEntities: true and passes its filter evasion tests without the need for recursive invocation. Thanks.

If the behavior with decodeEntities: false is inherently unsafe I wonder if it should be offered at all in the next release. But maybe it has an application I'm not seeing.

Always depending on encode() is not a good opinion, because it also encode CJK chars into entities, which is hard to do string operations (chars and length are changed)...