buriy/python-readability

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

web64 opened this issue · 4 comments

web64 commented

Hi,

I'm getting this warning:

readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

I'm running Python 3.5.2

Cheers!

Same here..
Any news on that?
What is the thing we have to correct?

Appears to be the :

doc.body or doc

statement

I actually was getting bad results, not just warnings (a string containing a repr of a byte buffer). Simple samples code did not have this, only with a real web page. Unclear if related (might warrant a new issue).

Ended up Monkey patching in a hack, still got warning but at least it worked:

from lxml.etree import tostring
import readability
from readability import Document  # https://github.com/buriy/python-readability/   pip install readability-lxml

## monkey patch

def get_body(doc):
    for elem in doc.xpath(".//script | .//link | .//style"):
        elem.drop_tree()
    # tostring() always return utf-8 encoded string
    # FIXME: isn't better to use tounicode?
    print('MY DEBUG')
    #raw_html = str_(tostring(doc.body or doc))
    #raw_html = tostring(doc.body or doc)
    raw_html = tostring(doc.body or doc, encoding='utf-8').decode('utf-8')
    #import pdb ; pdb.set_trace()
    #raw_html = doc.body or doc
    cleaned = readability.cleaners.clean_attributes(raw_html)
    try:
        # BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
        return cleaned
    except Exception:  # FIXME find the equivalent lxml error
        # logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
        return raw_html


def content(self):
    """Returns document body"""
    #return get_body(self._html(True))
    print('MY DEBUG')
    return get_body(self._html(True))

Document.content = content
## monkey patch

image

I was using one line to validate the response of a tag