TeamHG-Memex/scrapy-rotating-proxies

Track dead/alive proxies with authentication

petermoore14 opened this issue · 4 comments

The built-in HttpProxyMiddleware will correctly set up the authentication parameters for proxies sent up from the rotating proxies plugin. However, by doing so, request.meta['proxy'] field is changed to only contain the raw proxy_url, as the credentials are ripped out. When the response is received, rotating proxies will try to mark the proxy as good or dead, but will silently fail because of the 'proxy not in self.proxies' check, resulting in all proxies staying unmarked forever. This behaviour can be verified by using any proxy with authentication and observing that the logstats keeps logging everything as unchecked while scrapy is crawling.

Easy fix is to update '_handle_result' to identify the proxy in self.proxies corresponding to the input request.meta['proxy'], and use this unabridged proxy in the rest of the call. Will make a PR for this unless you have any objections to this approach

kmike commented

A good catch. I haven't checked at all how this package works with proxies+auth. Your proposed fix sounds fine to me. Proably a minor point, but I'd like to avoid O(N) scanning at each request, i.e. it may be better to build the short-long mapping at startup.

Ok cool, that makes sense. I'll update my PR to use a dict instead to speed things up.

Updated with a hostport->proxies map to optimize retrieval.

kmike commented

Fixed by #8 - thanks @petermoore14!