Add an option to use `to_text` instead of `to_html` to `FullSanitizer`
Earlopain opened this issue · 13 comments
Rails has the strip_tags method, which is great. Howerever for input like <p>a</p><p>b</p> the output is ab which doesn't follow what is displayed in the browser.
Loofah has the function to_text which handles newlines like this. I would like to introduce an option like preserve_whitespace (or similar) to the FullSanitizer to do just that.
Is that something you would consider adding?
@Earlopain this seems like a good idea to me, and I'd consider merging something like this. Do you want to try to implement this in a pull request? I'd be happy to support and mentor you if that's something that would be helpful!
@flavorjones I'd like to give it a shot. Thank you for the mentoring offer, I really appreciate that. I'll see how far I get and if I get stuck anywhere I'll get back to you on that.
I should have time for this at the end of the week by the latest.
OK, I'm going to step back and ask whether your use case could be addressed by just using Loofah's built-in scrubber:
test "ActionView::Helpers::SanitizeHelper#sanitize with a Loofah builtin scrubber" do
input = %(<div>hello</div><p>world</p>)
assert_equal(%(\nhello\n\nworld\n), @subject.sanitize(input, scrubber: :newline_block_elements))
endThat test passes today!
The symbol :newline_block_elements is a shortcut for Loofah::Scrubbers::NewlineBlockElements which is what #to_text uses under the hood.
I'm also curious why you're not just using Loofah directly?
Thanks for the reply. I'm mostly looking to improve the functionality in the rails strip_tags method, which I guess would come one step after this. I came here since rails delegates that functionality to this gem. Thinking about it, I could try to get it into rails itself but I feel that wouldn't be the correct place since the sanitization logic is all here.
I occasionally find myself wanting to scrub html, loosing whitespace was never what I wanted. I know I can use Loofah (though I wasn't aware of the scrubber syntax) but the native rails method just doesn't work for me. I think it would be great if rails could support this behaviour directly.
I occasionally find myself wanting to scrub html, loosing whitespace was never what I wanted
This is the use case I'm curious to know more about. What kind of view is it, what kind of user agent is rendering the output?
I think understanding the use case will be good to make the case upstream to change or add features in SanitizeHelper.
Of course, I'm hoping I'm not getting to long-winded here:
Consider a page where users can write text in the form of markdown, like in Github Issues. There is a bunch of decorative text like the issue title or who wrote the issue, but the only relevant part is what the user provided, like this for example:
# Cool title
Very descriptive textIn html terms this would mean something like this:
<h1 id="cool-title">Cool title</h1><p>Very descriptive text</p>Usually search engines are pretty good about what text they include in their search results but sometimes the text from outside gets included. Something like User X opened this issue last week Add an option to use.... The solution is to provide <meta description="XYZ"> to nudge search engines to what's important but since the source text is markdown I can't embed that directly. I also can't use strip_text since that results in Cool titleVery descriptive text which is missing a space.
Another use case is for web scraping where you consume and store arbitrary HTML from websites. I'd like to display that to users without embeding untrusted input while at least preserving the general structure of the text without making it unreadable.
I realize that these things are probably very specific to me, I Just thought a few others might find a quick solution useful as well. I'm perfectly fine with just using Loofah myself.
@Earlopain Thank you for this context! Very helpful, I get it now.
Let me think about this a bit. I think you're making a good case that the existing behavior of strip_tags is just ... not useful. I'm thinking about how to approach a conversation with the Rails committers about that.
Hi there @flavorjones, did you have a chance to give this some more thought?
@Earlopain Thanks for the ping. This is still on my to-do list. So many Loofah and RHS changes have gone into Rails 7.1 that I've been kicking this can down the road until after 7.1 ships (which should be pretty soon). I've also learned a bit more about how changes like this can be framed/proposed, and I have an idea of how I want to drive that conversation.
Thanks for your patience.
Waiting for after Rails 7.1 is totally fair. Thanks for letting me know
@flavorjones Thanks for sharing the scrubber: :newline_block_elements option in #154 (comment). This is super helpful!
Tagging this on the Rails Conf 2024 hack day project board. If someone wants to think/talk about this that would be great!
I'm looking at writing some tests/PR in actionview for this!