Access to RawHTML from selectors
ahetawal-p opened this issue · 2 comments
ahetawal-p commented
Hello,
I need to get access to the raw HTML in one of Item instances.
Currently the XPath
or CSS
selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_
method.
But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_
method for a given key.
Thank you,
howie6879 commented
Hi @ahetawal-p :
There is a solution here:
# script: example/search.py
class Bing(Item):
__name__ = 'bing'
__base_url__ = 'https://www.bing.com'
url = Css('h2 a', attr='href')
title = Css('h2 a')
def clean_url(self, url):
if isinstance(url, list):
url = url[0].get('href')
return url
# get access to the raw HTML
class Bing(Item):
__name__ = 'bing'
__base_url__ = 'https://www.bing.com'
url = Css('h2 a', attr='html')
title = Css('h2 a')
def clean_url(self, url):
from lxml.html import tostring
html = tostring(url[0])
# do something here ...
url = ...
return url
ahetawal-p commented
Thanks a lot @howie6879. I was able to get my rawHTML based on your solutions.
Thank you again !!