bug: `\n` when stripping nested tags
drjova opened this issue · 5 comments
Describe the bug
A clear and concise description of what the bug is. [e.g. "bleach.clean
does not escape script tag contents"]
** python and bleach versions (please complete the following information):**
- Python Version: 3.8.9
- Bleach Version: 5.0.0
To Reproduce
Steps to reproduce the behavior:
from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example
example</div>
"""
Expected behavior
from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example example</div>
"""
Thank you 🙏
h1
is a block level tag. Bleach 5.0.0 fixed sanitizing so that when it removes block-level tags, it adds a \n
because that's what HTML parsers would do in those circumstances. The problem was covered in issue #369.
@willkg Thank you for the explanation. It would be nice to have an option to disable this since not all use-cases need to make the text more readable. Would it be considered if I made a PR?
What's your use case that this is problematic?
In our case we would like to clean specific tags, including block-level
tags, without formatting the content.
That doesn't really answer my question--it mostly restates the bug. What's the use case here? Why is adding a \n
problematic?