Issue with self-closing tags
azmeuk opened this issue · 1 comments
azmeuk commented
I encountered an issue with the Rust-lang blog. There are empty <a>
tags that are used as anchors:
<h2><a class="anchor" href="https://blog.rust-lang.org/2019/12/19/Rust-1.40.0.html#whats-in-1.40.0-stable" id="whats-in-1.40.0-stable"></a>What's in 1.40.0 stable</h2>
As the tag is empty, readability transforms it in a self closed tag <a/>
:
import readability
>>> readability.Document('<h2><a href="#"></a>foobar</h2>').summary()
'<body id="readabilityBody"><h2><a href="#"/>foobar</h2></body>'
Self-closing a
tags are not well handled by webkit (for instance), so this leads to incorrect HTML display:
It seems to be because of lxml tounicode
function is missing the 'method' argument:
>>> from lxml.etree import tounicode
>>> from lxml.html import document_fromstring
>>> tounicode(document_fromstring('<h2><a href="#"></a>Foobar</h2>'))
'<html><body><h2><a href="#"/>Foobar</h2></body></html>'
>>> tounicode(document_fromstring('<h2><a href="#"></a>Foobar</h2>'), method='html')
'<html><body><h2><a href="#"></a>Foobar</h2></body></html>'
What do you think?
Éloi
buriy commented
Thanks a lot!
Another way to check is
python3 -m readability.readability -u https://github.com/buriy/python-readability/
its output had:
...
<a name="user-content-thanks-to"/>
...
now it has:
...
<a name="user-content-thanks-to"></a>
...