[Accessibility bug] Semantic whitespace implied by block elements isn't retained properly
Opened this issue · 6 comments
Separate block elements imply a visual spacing when seen on screen, which should be retained when removing those elements to keep proper word separation. Bleach doesn't seem to handle this right now:
>>> import bleach
>>> html = "<p>Te<b>st</b>!</p><p>Hello</p>"
>>> bleach.clean(html, tags=[], strip=True)
'Test!Hello'
>>>
The expected result would be: 'Test! Hello'
(since <p>
is a block element, and therefore two of them after each other implies a visual line break that is vital for proper readability of the text)
Edit: just to make this clear, I am not proposing parsing the CSS to find out what is a block element or something over-the-top like that. But at least reasonable default behavior would be nice which covers proper semantic HTML (without support for rogue CSS that unreasonably makes <p>
inline or nonsense like that). That would work properly for 99% of the web content out there, unlike the current implementation which seems destined to produce missing vital whitespace on any non-trivial page.
Can you make a list of which block elements you want to handle?
https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements#Elements using this or a similar list would be a good idea I think
Transcribing that list here:
- address
- article
- aside
- blockquote
- canvas
- dd
- div
- dl
- dt
- fieldset
- figcaption
- figure
- footer
- form
- h1, h2, h3, h4, h5, h6
- header
- hgroup
- hr
- li
- main
- nav
- noscript
- ol
- output
- p
- pre
- section
- table
- tfoot
- ul
- video
Can you take a stab at implementing this? I don't think I'm going to get to this for a while.
I wrote a rich text layouter recently that imports HTML where I got this as a by product, so right now I have no immediate need for this. I just thought it'd be a nice thing to add at some point
I took a stab at this ticket. The solution is not perfect but allows for readable and accessible text.
While discussion on my approach should continue on the PR I wanted to bring up on this discussion:
The whitespace character should be a NEWLINE and not a SPACE.
input: <p>Te<b>st</b>!</p><p>Hello</p>
output:
- 'Test! Hello'
+ 'Test!\nHello'
The main reason is because more complex use cases, such as lists, become unreadable with a space character -- all the blocks bleed together.
The whitespace character should be a NEWLINE and not a SPACE.
Yeah you are correct I got that wrong, between block elements there should be a newline since that is also how it is rendered in a browser 👍