TeamHG-Memex/scrapy-rotating-proxies

Update proxy list once a day

mo-tech55 opened this issue · 1 comments

I use a proxy list from a proxy provider and my proxy list gets renewed once a day. I get the proxy list from the provider via their api.

settings.py:

ROTATING_PROXY_LIST = proxy_list()
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

def proxy_list() -> list:
    response = requests.get('https://proxy.webshare.io/api/proxy/list', headers={"Authorization":f"Token{API_KEY}"}).json()
    proxy_results = [ f'{proxy_elem["proxy_address"]}:{proxy_elem["ports"]["http"]}' for proxy_elem in response['results']]
    return proxy_results

The above code works but I think scrapy executes my proxy_list method just once, when I start my spider.

I keep my spider running in the spider.py file with following code:

def sleeping(self, *args, seconds):
    '''Non blocking sleep callback'''
    return deferLater(reactor, seconds, lambda: None)

process = CrawlerProcess(get_project_settings())
def _crawl(result, spider, sleeptime=60):
     deferred = process.crawl(spider)
    deferred.addCallback(lambda results: print(f'waiting {sleeptime} seconds before restart...'))
    deferred.addCallback(sleeping, seconds=sleeptime)
    deferred.addCallback(_crawl, spider)
    return deferred

if __name__ == "__main__":
    _crawl(None, AmazonSpider, sleeptime=1800)
    process.start()

So the question is how to tell scrapy to update my proxy list every few hours.

See issue number #40 for the solution.