mozilla/bleach

bug: `\n` when stripping nested tags

drjova opened this issue · 5 comments

Describe the bug

A clear and concise description of what the bug is. [e.g. "bleach.clean does not escape script tag contents"]

** python and bleach versions (please complete the following information):**

  • Python Version: 3.8.9
  • Bleach Version: 5.0.0

To Reproduce

Steps to reproduce the behavior:

from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example
 example</div>
"""

Expected behavior

from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example example</div>
"""

Thank you 🙏

h1 is a block level tag. Bleach 5.0.0 fixed sanitizing so that when it removes block-level tags, it adds a \n because that's what HTML parsers would do in those circumstances. The problem was covered in issue #369.

@willkg Thank you for the explanation. It would be nice to have an option to disable this since not all use-cases need to make the text more readable. Would it be considered if I made a PR?

What's your use case that this is problematic?

In our case we would like to clean specific tags, including block-level tags, without formatting the content.

That doesn't really answer my question--it mostly restates the bug. What's the use case here? Why is adding a \n problematic?