TeamHG-Memex/scrapy-rotating-proxies

Scrapy stuck when page not response

herbert-h opened this issue · 5 comments

Scrapy stuck when page not response, can I give a timeout for page?

...
2018-01-22 09:27:09 [scrapy.extensions.logstats] INFO: Crawled 183 pages (at 42 pages/min), scraped 183 items (at 42 items/min)
2018-01-22 09:27:09 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 0, reanimated: 3, mean backoff time: 76s)
2018-01-22 09:27:39 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 4, unchecked: 0, reanimated: 0, mean backoff time: 159s)
2018-01-22 09:28:09 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 13 pages/min), scraped 196 items (at 13 items/min)
2018-01-22 09:28:09 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 3, unchecked: 0, reanimated: 1, mean backoff time: 199s)
2018-01-22 09:28:39 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 3, unchecked: 0, reanimated: 1, mean backoff time: 199s)
2018-01-22 09:29:09 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 0 pages/min), scraped 196 items (at 0 items/min)
2018-01-22 09:29:09 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 2, mean backoff time: 242s)

It's wait more than 5 minutes to try first retry

Need more info to help you, it could be a network problem on your end or in the server you're scraping from.

I run into the same problem. When using a proxy the default download timeout (of 180 seconds) is used.
You can adjust this with download_timeout or via its setting

To explain this: You are running out of proxies. The middleware has default delay of 180 seconds which means it will use proxy A only once every 3 minutes. In your case all of your proxies are still waiting to cool down thus crawler has no proxies/slots and is waiting.

I have this problem too. But in my case the i saw that still have unchecked proxies.

I guess issue #33 's suggested fix fixed it for me. In line 123 of middlewares.py, I replaced if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'): with if 'proxy' in request.meta:. I guess this worked for me, but I don't know with absolute surety.