buriy/python-readability

get_clean_html: lxml error

dufferzafar opened this issue · 4 comments

>>> u = "https://www.geeksforgeeks.org/samsung-research-institute-bangalore-srib-intern/"

>>> import requests

>>> r = requests.get(u)

>>> from readability import Document

>>> doc = Document(r.content)

>>> doc.get_clean_html()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dufferzafar/.local/lib/python3.7/site-packages/readability/readability.py", line 167, in get_clean_html
    return clean_attributes(tounicode(self.html))
  File "src/lxml/etree.pyx", line 3397, in lxml.etree.tounicode
TypeError: Type '<class 'NoneType'>' cannot be serialized.
Type '<class 'NoneType'>' cannot be serialized.

The doc.summary() method works, but it doesn't seem to have all the data that we want.

And I think I've found the error.

Line 9 of the file should actually be:

   return clean_attributes(tounicode(self._html(True)))

So that it forces the self.html attribute to be set.

Gonna roll with this change for now.

buriy commented

get_clean_html is made for summary cleaning up, not for the full doc cleaning up.
You can clean the full doc yourself, you don't need this lib for that ;)

I get the same error even if i used get_clean_html only on requests.get(url).text, but only on a spécifique url :https://start.lesechos.fr/travailler-a-letranger/actu-internationales/expatriation-les-pays-qui-chouchoutent-le-plus-les-talents-13988.php

page_response = requests.get(page_link)
doc  = Document(page_response.text)
doc.get_clean_html()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-f64b1934a8ea> in <module>
      1 page_response = requests.get(page_link)
      2 doc  = Document(page_response.text)
----> 3 doc.get_clean_html()

C:\ProgramData\Anaconda3\lib\site-packages\readability\readability.py in get_clean_html(self)
    165         to disable or to improve DOM-to-text conversion in .summary() method
    166         """
--> 167         return clean_attributes(tounicode(self.html))
    168 
    169     def summary(self, html_partial=False):

src/lxml/etree.pyx in lxml.etree.tounicode()

TypeError: Type '<class 'NoneType'>' cannot be serialized.