mozilla/bleach

[Accessibility bug] Semantic whitespace implied by block elements isn't retained properly

Opened this issue · 6 comments

Separate block elements imply a visual spacing when seen on screen, which should be retained when removing those elements to keep proper word separation. Bleach doesn't seem to handle this right now:

>>> import bleach
>>> html = "<p>Te<b>st</b>!</p><p>Hello</p>"
>>> bleach.clean(html, tags=[], strip=True)
'Test!Hello'
>>>

The expected result would be: 'Test! Hello' (since <p> is a block element, and therefore two of them after each other implies a visual line break that is vital for proper readability of the text)

Edit: just to make this clear, I am not proposing parsing the CSS to find out what is a block element or something over-the-top like that. But at least reasonable default behavior would be nice which covers proper semantic HTML (without support for rogue CSS that unreasonably makes <p> inline or nonsense like that). That would work properly for 99% of the web content out there, unlike the current implementation which seems destined to produce missing vital whitespace on any non-trivial page.

Can you make a list of which block elements you want to handle?

https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements#Elements using this or a similar list would be a good idea I think

Transcribing that list here:

  • address
  • article
  • aside
  • blockquote
  • canvas
  • dd
  • div
  • dl
  • dt
  • fieldset
  • figcaption
  • figure
  • footer
  • form
  • h1, h2, h3, h4, h5, h6
  • header
  • hgroup
  • hr
  • li
  • main
  • nav
  • noscript
  • ol
  • output
  • p
  • pre
  • section
  • table
  • tfoot
  • ul
  • video

Can you take a stab at implementing this? I don't think I'm going to get to this for a while.

I wrote a rich text layouter recently that imports HTML where I got this as a by product, so right now I have no immediate need for this. I just thought it'd be a nice thing to add at some point

I took a stab at this ticket. The solution is not perfect but allows for readable and accessible text.

While discussion on my approach should continue on the PR I wanted to bring up on this discussion:

The whitespace character should be a NEWLINE and not a SPACE.

input: <p>Te<b>st</b>!</p><p>Hello</p>
output:
- 'Test! Hello'
+ 'Test!\nHello'

The main reason is because more complex use cases, such as lists, become unreadable with a space character -- all the blocks bleed together.

The whitespace character should be a NEWLINE and not a SPACE.

Yeah you are correct I got that wrong, between block elements there should be a newline since that is also how it is rendered in a browser 👍