rails/rails-html-sanitizer

The sanitization method changes the tag structure if there is a `<table>` tag inside an `<a>` tag.

naitoh opened this issue · 3 comments

Description

In the sanitize method, if there is <table> tag inside <a> tag, the result will be different than expected.

Steps to Reproduce

$ ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
$ gem list rails-html-sanitizer loofah nokogiri crass

*** LOCAL GEMS ***

rails-html-sanitizer (1.5.0)
loofah (2.20.0)
nokogiri (1.14.2 arm64-darwin
crass (1.0.6)

No problem case

> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><li>test</li></a>', tags: %w(a li), attributes: %w(href))
=> "<a href=\"https://example.com\"><li>test</li></a>"
> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><dummy>test</dummy></a>', tags: %w(a dummy), attributes: %w(href))
=> "<a href=\"https://example.com\"><dummy>test</dummy></a>"

Problem case

> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a table), attributes: %w(href))
=> "<a href=\"https://example.com\"></a><table>test</table>"

I would expect <a href=\"https://example.com\"><table>test</table></a> response.

But it may be a problem with the behavior of libxml2 (Nokogiri's HTML4 parser)....

> puts Nokogiri::HTML4('<a href="https://example.com"><table>test</table></a>').to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="https://example.com"></a><table>test</table>
</body></html>

🙅

> puts Nokogiri::HTML5('<a href="https://example.com"><table>test</table></a>').to_html
<html><head></head><body><a href="https://example.com">test<table></table></a></body></html>

👌

Hi, thanks for asking this question. As you diagnosed, you're seeing the behavior of the HTML4 parser used by Nokogiri (libxml2).

Nokogiri::HTML4::DocumentFragment.parse('<a href="https://example.com"><table>test</table></a>').to_html
# => "<a href=\"https://example.com\"></a><table>test</table>"
Nokogiri::HTML5::DocumentFragment.parse('<a href="https://example.com"><table>test</table></a>').to_html
# => "<a href=\"https://example.com\">test<table></table></a>"

Nokogiri just wraps the parser, and so there's nothing we can easily do to change this behavior.

Upgrading the full stack of rails-html-sanitizer, Loofah, and Nokogiri to support HTML5 has been a long road. Loofah was just this week released with HTML5 support, and now I'm working on updating rails-html-sanitizer. This is what I've been working on for the last few days.

I just closed (this morning) the previous PR #133 which was an exploration of behavioral differences, because I'm very close to shipping a new PR with the necessary API and code changes. Hang tight!

See #158 for the latest on HTML5 support. On that branch:

Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a tab
le), attributes: %w(href))
# => "<a href=\"https://example.com\"></a><table>test</table>"
Rails::HTML5::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a ta
ble), attributes: %w(href))
# => "<a href=\"https://example.com\">test<table></table></a>"

Thank you for your comment.
I will wait for the release of a version where Rails::HTML5::SafeListSanitizer can be used.