elliotgao2/toapi

Access to RawHTML from selectors

ahetawal-p opened this issue · 2 comments

Hello,
I need to get access to the raw HTML in one of Item instances.
Currently the XPath or CSS selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_ method.
But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_ method for a given key.

Thank you,

Hi @ahetawal-p :

There is a solution here:

# script: example/search.py

class Bing(Item):
    __name__ = 'bing'
    __base_url__ = 'https://www.bing.com'

    url = Css('h2 a', attr='href')
    title = Css('h2 a')

    def clean_url(self, url):
        if isinstance(url, list):
            url = url[0].get('href')
        return url
# get access to the raw HTML 
class Bing(Item):
    __name__ = 'bing'
    __base_url__ = 'https://www.bing.com'
    
    url = Css('h2 a', attr='html')
    title = Css('h2 a')

    def clean_url(self, url):
        from lxml.html import tostring
        html = tostring(url[0])
        # do something here ...
        url = ...
        return url

Thanks a lot @howie6879. I was able to get my rawHTML based on your solutions.
Thank you again !!