Don't use drop_tree() while iterating
P1zz4br0etch3n opened this issue · 2 comments
P1zz4br0etch3n commented
I found that using drop_tree() while iterating through all nodes in the method remove_unlikely_candidates() breaks the iterator if the dropped element has children.
a simple example:
from lxml.etree import tostring
from readability import Document
HTML = '''
<html>
<body>
<aside id="1">
<a>
</aside>
<aside id="2"/>
<aside id="3"/>
</body>
</html>
'''
# parse html
DOC = Document(HTML)
html = DOC._html()
# iter through all elements
for elem in html.iter():
if elem.tag == 'aside':
elem.drop_tree()
# print resulting html
print(tostring(DOC.html, pretty_print=True))
output:
<html>
<body>
<aside id="2"/>
<aside id="3"/>
</body>
</html>
Actually every aside tag should be deleted, but it didn't.
If you leave out the a tag, everything works fine.
A solution to this could be to wrap the iter statement in a list:
for elem in list(self.html.iter()):
Or to collect the elements that should be removed in a list, and drop them after iterating.
P1zz4br0etch3n commented
Another approach would be to use self.html.findall('.//*')
buriy commented
Thanks a lot! Fixed in version 0.7.