buriy/python-readability

Issue with self-closing tags

azmeuk opened this issue · 1 comments

I encountered an issue with the Rust-lang blog. There are empty <a> tags that are used as anchors:

<h2><a class="anchor" href="https://blog.rust-lang.org/2019/12/19/Rust-1.40.0.html#whats-in-1.40.0-stable" id="whats-in-1.40.0-stable"></a>What's in 1.40.0 stable</h2>

As the tag is empty, readability transforms it in a self closed tag <a/>:

import readability
>>> readability.Document('<h2><a href="#"></a>foobar</h2>').summary()
'<body id="readabilityBody"><h2><a href="#"/>foobar</h2></body>'

Self-closing a tags are not well handled by webkit (for instance), so this leads to incorrect HTML display:

Capture d’écran du 2019-12-29 14-53-51

It seems to be because of lxml tounicode function is missing the 'method' argument:

>>> from lxml.etree import tounicode
>>> from lxml.html import document_fromstring
>>> tounicode(document_fromstring('<h2><a href="#"></a>Foobar</h2>'))
'<html><body><h2><a href="#"/>Foobar</h2></body></html>'
>>> tounicode(document_fromstring('<h2><a href="#"></a>Foobar</h2>'), method='html')
'<html><body><h2><a href="#"></a>Foobar</h2></body></html>'

What do you think?

Éloi

buriy commented

Thanks a lot!
Another way to check is
python3 -m readability.readability -u https://github.com/buriy/python-readability/
its output had:

...
<a name="user-content-thanks-to"/>
...

now it has:

...
<a name="user-content-thanks-to"></a>
...