TeamHG-Memex/scrapy-rotating-proxies

Proxies Stuck in unchecked state

john-parton opened this issue · 11 comments

After running the crawler for over a day, I still have a lot of proxies in the "unchecked" state.

[rotating_proxies.middlewares] INFO: Proxies(good: 147, dead: 3226, unchecked: 524, reanimated: 167, mean backoff time: 4254s)

It looks like those 524 unchecked proxies are just timing out, but they're not getting moved to dead, so a lot of time is wasted sending requests to them.

I set my timeout pretty low with DOWNLOAD_TIMEOUT = 15.

Let me know if you need anything from me: parts of my crawler, settings, etc.

Thanks.

Edit: I have the BanDetectionMiddleware installed.

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

I have this problem too, lots of unchecked proxies, but I have no dead ones.

[rotating_proxies.middlewares] INFO: Proxies(good: 97, dead: 0, unchecked: 97, reanimated: 6, mean backoff time: 0s)

Edit:

I think the 'problem' is that the proxy is loaded randomly between good and unchecked ones.

The main issue here is that I have a DOWNLOAD_DELAY set to 1000 seconds, according to the docs, it should be set per-proxy now. So am I wrong when saying that, in theory, if I have 100 proxies, they should each begin with a request and then every one of them has its own 1000 second delay?

If so, getting a new proxy randomly from good un unchecked ones would slow down the spider. In theory you could end up randomly getting only 10 of the 100 proxies each time get_random() is called, so you'd wait 1000 seconds per each of the the 10 proxies, and having 90 unused proxies.

Thoughts on this?

sorry to shamelessly bump, but bump?

Have the same problem, bump

Got the same problem. Have another question too. Does scrapy wait until all the unchecked count becomes 0 to start using good proxies?
image
image
Because max retry count is 5. It looks like a good proxy has not been used 6 times.

shameless bump

I have the same problem

Same issue. For me, it appears that it's kind of de-duplicating proxies based off the host and port. So if they're the same, they remain unchecked and only one is used.

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

bump

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

iirc, yes

bump

please check my solution above - could still be working