TeamHG-Memex/scrapy-rotating-proxies

All proxies unchecked except one

milicamilivojevic opened this issue · 4 comments

Proxies(good: 1, dead: 0, unchecked: 19999, reanimated: 0, mean backoff time: 0s)
Proxies(good: 0, dead: 1, unchecked: 19999, reanimated: 0, mean backoff time: 572s)
I have the list of 20,000 proxies in this format:
username1:password1@host:port
username2:password2@host:port
username3:password3@host:port
username4:password4@host:port
username5:password5@host:port
username6:password6@host:port
username7:password7@host:port
All ips and ports are the same, but usernames and passwords are different.
With this format only one ip is used and other are unchecked even if I have 20,000 proxies.
Can you please help me?

Getting the same issue, my guess is it has something to do with handling proxy authorization.

@darshanlol is right, it involves proxy authorization, but not directly.

The problem is in the get_proxy method from expire.py: if you are using authorized IP addresses, this method will retrieve only the root domain from the proxy. The Scrapy driver removes the authorization part of the proxy address and it messes up the good/dead/unchecked marking.

I just added another key to the request.meta with the original proxy IP and it worked. It's ultimately a workaround to the aforementioned issue, but it works. These are the changes I made in the middlewares.py file, on the RotatingProxyMiddleware class.

    def process_request(self, request, spider):
        if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'):
            return
        proxy = self.proxies.get_random()
        if not proxy:
            if self.stop_if_no_proxies:
                raise CloseSpider("no_proxies")
            else:
                logger.warn("No proxies available; marking all proxies "
                            "as unchecked")
                self.proxies.reset()
                proxy = self.proxies.get_random()
                if proxy is None:
                    logger.error("No proxies available even after a reset.")
                    raise CloseSpider("no_proxies_after_reset")

        request.meta['proxy'] = proxy
        request.meta['download_slot'] = self.get_proxy_slot(proxy)
        request.meta['_rotating_proxy'] = True
        request.meta['_original_proxy_url'] = proxy    # adding new variable here

...

    def _handle_result(self, request, spider):
        proxy = request.meta.get("_original_proxy_url", None)      # changing proxy variable to grab from request.meta
        if not (proxy and request.meta.get("_rotating_proxy")):
            return
        self.stats.set_value(
            "proxies/unchecked",
            len(self.proxies.unchecked) - len(self.proxies.reanimated),
        )
        self.stats.set_value("proxies/reanimated", len(self.proxies.reanimated))
        self.stats.set_value("proxies/mean_backoff", self.proxies.mean_backoff_time)
        ban = request.meta.get("_ban", None)
        if ban is True:
            self.proxies.mark_dead(proxy)
            self.stats.set_value("proxies/dead", len(self.proxies.dead))
            return self._retry(request, spider)
        elif ban is False:
            self.proxies.mark_good(proxy)
            self.stats.set_value("proxies/good", len(self.proxies.good))
3hhh commented

I can confirm this one.

It seems that requests go out with the different usernames & passwords initially even though the reporting doesn't work.

However retries tend to use the same bad proxy again and again essentially making them useless. So this one is not just about reporting. I'm not sure whether the above workaround also helps with that.

Maybe it's because [1] removes user & password from request.meta['proxy'], which is then later returned with the response and hits [2] incl. the _retry().

[1] https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/httpproxy.html#HttpProxyMiddleware
[2] https://github.com/TeamHG-Memex/scrapy-rotating-proxies/blob/master/rotating_proxies/middlewares.py#L161

3hhh commented

I guess this one is not too uncommon as rotating proxy providers tend to implement APIs via the usernames.