The sanitization method changes the tag structure if there is a `<table>` tag inside an `<a>` tag.
naitoh opened this issue · 3 comments
Description
In the sanitize method, if there is <table> tag inside <a> tag, the result will be different than expected.
Steps to Reproduce
$ ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
$ gem list rails-html-sanitizer loofah nokogiri crass
*** LOCAL GEMS ***
rails-html-sanitizer (1.5.0)
loofah (2.20.0)
nokogiri (1.14.2 arm64-darwin
crass (1.0.6)
No problem case
> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><li>test</li></a>', tags: %w(a li), attributes: %w(href))
=> "<a href=\"https://example.com\"><li>test</li></a>"
> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><dummy>test</dummy></a>', tags: %w(a dummy), attributes: %w(href))
=> "<a href=\"https://example.com\"><dummy>test</dummy></a>"
Problem case
> Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a table), attributes: %w(href))
=> "<a href=\"https://example.com\"></a><table>test</table>"
I would expect <a href=\"https://example.com\"><table>test</table></a> response.
But it may be a problem with the behavior of libxml2 (Nokogiri's HTML4 parser)....
> puts Nokogiri::HTML4('<a href="https://example.com"><table>test</table></a>').to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="https://example.com"></a><table>test</table>
</body></html>
🙅
> puts Nokogiri::HTML5('<a href="https://example.com"><table>test</table></a>').to_html
<html><head></head><body><a href="https://example.com">test<table></table></a></body></html>
👌
Hi, thanks for asking this question. As you diagnosed, you're seeing the behavior of the HTML4 parser used by Nokogiri (libxml2).
Nokogiri::HTML4::DocumentFragment.parse('<a href="https://example.com"><table>test</table></a>').to_html
# => "<a href=\"https://example.com\"></a><table>test</table>"
Nokogiri::HTML5::DocumentFragment.parse('<a href="https://example.com"><table>test</table></a>').to_html
# => "<a href=\"https://example.com\">test<table></table></a>"Nokogiri just wraps the parser, and so there's nothing we can easily do to change this behavior.
Upgrading the full stack of rails-html-sanitizer, Loofah, and Nokogiri to support HTML5 has been a long road. Loofah was just this week released with HTML5 support, and now I'm working on updating rails-html-sanitizer. This is what I've been working on for the last few days.
I just closed (this morning) the previous PR #133 which was an exploration of behavioral differences, because I'm very close to shipping a new PR with the necessary API and code changes. Hang tight!
See #158 for the latest on HTML5 support. On that branch:
Rails::Html::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a tab
le), attributes: %w(href))
# => "<a href=\"https://example.com\"></a><table>test</table>"
Rails::HTML5::SafeListSanitizer.new.sanitize('<a href="https://example.com"><table>test</table></a>', tags: %w(a ta
ble), attributes: %w(href))
# => "<a href=\"https://example.com\">test<table></table></a>"Thank you for your comment.
I will wait for the release of a version where Rails::HTML5::SafeListSanitizer can be used.