voku/anti-xss

False positive for `foo="<span class="bar">baz</span>"`

Closed this issue · 4 comments

Input:

<span>
    foo="<span class="bar">baz</span>"
</span>

Expected result (no changes):

<span>
    foo="<span class="bar">baz</span>"
</span>

Actual result:

<span>
    foo="&lt;span class="bar">baz</span>"
</span>

The change is coming from this part:

$str = (string) \preg_replace_callback(
'#<(?!!--|!\[)((?<start>/*\s*)((?<tagName>[\p{L}:]+)(?=[^\p{L}]|$|)|.+)[^\s"\'\p{L}>/=]*[^>]*)(?<closeTag>>)?#iusS', // tags without comments
function ($matches) {
if (
$this->_do_not_close_html_tags !== []
&&
isset($matches['tagName'])
&&
\in_array($matches['tagName'], $this->_do_not_close_html_tags, true)
) {
return $matches[0];
}
return $this->_close_html_callback($matches);
},
$str
);

The part above is changing the code, but the cause seems to be this regex:

while (\preg_match("/[?|&]?[\p{L}0-9_\-\[\]]+\s*=\s*([\"'])(?<attr>[^\1]*?)\\1/u", $strCopy, $matches)) {

It detects foo="..." as an attribute, but actually it is content of the outer span tag.

voku commented

Hi, thanks for the bug report.

I already looked into it and the problem is that we use $regExForHtmlTags = '/<\p{L}+.*+/us'; to decode the HTML attributes, but because we want to decode also HTML attributes in broken tags, I thought it's a good idea to look only for the open tags.

If we replace it with $regExForHtmlTags = '/<\p{L}+.*>/usU';, then it's working, but I need to check what this is changing for all the test cases.

EDIT: with this change e.g. <a title="\'<<>>"> is not detected anymore :-/
https://regex101.com/r/jzMmDh/1 vs. https://regex101.com/r/uxFLyA/1

Maybe something like https://regex101.com/r/loBzsU/1?

voku commented

Thanks for the fix: :)

⇒ fixed in version 4.1.34