get_clean_html: lxml error
dufferzafar opened this issue · 4 comments
dufferzafar commented
>>> u = "https://www.geeksforgeeks.org/samsung-research-institute-bangalore-srib-intern/"
>>> import requests
>>> r = requests.get(u)
>>> from readability import Document
>>> doc = Document(r.content)
>>> doc.get_clean_html()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dufferzafar/.local/lib/python3.7/site-packages/readability/readability.py", line 167, in get_clean_html
return clean_attributes(tounicode(self.html))
File "src/lxml/etree.pyx", line 3397, in lxml.etree.tounicode
TypeError: Type '<class 'NoneType'>' cannot be serialized.
Type '<class 'NoneType'>' cannot be serialized.
dufferzafar commented
The doc.summary()
method works, but it doesn't seem to have all the data that we want.
dufferzafar commented
And I think I've found the error.
Line 9 of the file should actually be:
return clean_attributes(tounicode(self._html(True)))
So that it forces the self.html
attribute to be set.
Gonna roll with this change for now.
buriy commented
get_clean_html is made for summary cleaning up, not for the full doc cleaning up.
You can clean the full doc yourself, you don't need this lib for that ;)
MChrys commented
I get the same error even if i used get_clean_html only on requests.get(url).text, but only on a spécifique url :https://start.lesechos.fr/travailler-a-letranger/actu-internationales/expatriation-les-pays-qui-chouchoutent-le-plus-les-talents-13988.php
page_response = requests.get(page_link)
doc = Document(page_response.text)
doc.get_clean_html()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-f64b1934a8ea> in <module>
1 page_response = requests.get(page_link)
2 doc = Document(page_response.text)
----> 3 doc.get_clean_html()
C:\ProgramData\Anaconda3\lib\site-packages\readability\readability.py in get_clean_html(self)
165 to disable or to improve DOM-to-text conversion in .summary() method
166 """
--> 167 return clean_attributes(tounicode(self.html))
168
169 def summary(self, html_partial=False):
src/lxml/etree.pyx in lxml.etree.tounicode()
TypeError: Type '<class 'NoneType'>' cannot be serialized.