buriy/python-readability

Don't use drop_tree() while iterating

P1zz4br0etch3n opened this issue · 2 comments

I found that using drop_tree() while iterating through all nodes in the method remove_unlikely_candidates() breaks the iterator if the dropped element has children.

a simple example:

from lxml.etree import tostring
from readability import Document

HTML = '''
    <html>
        <body>
            <aside id="1">
                    <a>
            </aside>
            <aside id="2"/>
            <aside id="3"/>
        </body>
    </html>
'''

# parse html
DOC = Document(HTML)
html = DOC._html()

# iter through all elements
for elem in html.iter():
    if elem.tag == 'aside':
        elem.drop_tree()

# print resulting html
print(tostring(DOC.html, pretty_print=True))

output:

<html>
        <body>

            <aside id="2"/>
            <aside id="3"/>
        </body>
    </html>

Actually every aside tag should be deleted, but it didn't.
If you leave out the a tag, everything works fine.

A solution to this could be to wrap the iter statement in a list:
for elem in list(self.html.iter()):
Or to collect the elements that should be removed in a list, and drop them after iterating.

Another approach would be to use self.html.findall('.//*')

buriy commented

Thanks a lot! Fixed in version 0.7.