rails/rails-html-sanitizer

`strip_tags(input).html_safe? # => false to true` ?

dorianmariecom opened this issue · 8 comments

It seems like strip_tags is escaping the input, shouldn't it be html_safe? then?

I don't like adding .html_safe in my code but seems like strip_tags already escapes the input for strip_tags(input).html_safe should be safe

what do you think?

This seems reasonable to me, but I'd like to hear from @kaspth in case there's a reason this wasn't done before now.

I'd say no, because we've stripped all the HTML tags and thus there's no HTML left to declare safe. And since the tags are gone <%= strip_tags(content) %>, which is really short for <%= html_escape strip_tags(content) %> wouldn't be any different regardless of html_safe? or not.

Unless there's something I'm missing here?

My issue is that if I have an input like "Meet & Crush", I do:

> helper.strip_tags("Meet & Crush")
=> "Meet &amp; Crush"

So it escapes the & correctly, except that if I do:

> helper.strip_tags("Meet & Crush").html_safe?
=> false

So it will be escaped twice:

> helper.send(:h, helper.strip_tags("Meet & Crush"))
=> "Meet &amp;amp; Crush"

Which is what rails does in the views when the string is not html_safe?.

Which is why I think strip_tags(input).html_safe? should be true

Sorry I've taken so long to respond - I've queued this up for a closer look this weekend.

Sorry for the delay. Let me explain my mental models to help readers understand how I'm making the decision about this behavior.

Here's a quick demonstration of the difference between plain text serialization and html-safe plain text serialization:

frag = "<div>hello <!-- comment --><br>bye < then</div>" # note this has broken markup in it - unescaped `<`

# note that the broken markup gets fixed from `<` to `&lt;`
Nokogiri::HTML::DocumentFragment.parse(frag).to_html
# => "<div>hello <!-- comment --><br>bye &lt; then</div>"

# the naïve implementation of "strip tags" is to ask Nokogiri only to give us back the plaintext content of the HTML
# note that this is NOT html-safe because it contains a bare `<` character
Nokogiri::HTML::DocumentFragment.parse(frag).content
# => "hello bye < then"

# these two invocations are equivalent
# note that the `<` is escaped properly, meaning it's html-safe
>> Loofah.fragment(frag).text
=> "hello bye &lt; then"
>> strip_tags(frag)
=> "hello bye &lt; then"

The vulnerability that we're avoiding by using Loofah::DocumentFragment#text instead of Nokogiri::XML::Node#content is demonstrated here:

frag = "<div>&lt;<span>script</span>&gt;alert()&lt;<span>/script</span>&gt;</div>"

# this is why we cannot use Nokogiri::XML::Node#content for sanitization
Nokogiri::HTML::DocumentFragment.parse(frag).content
# => "<script>alert()</script>" # this is an XSS attack

Loofah.fragment(frag).text
# => "&lt;script&gt;alert()&lt;/script&gt;" # html-safe!
strip_tags(frag)
# => "&lt;script&gt;alert()&lt;/script&gt;" # html-safe!

So, what I hope I've shown above is that the content returned by strip_tags is indeed html-safe and the string should be marked as such.

I'll work on fixing this for an upcoming release.

See rails/rails#45218 for a PR that fixes this behavior.

Note that the rails PR is being backported to 7-0-stable. Thanks again for pointing this out!