TeamHG-Memex/scrapy-rotating-proxies

Refresh the list of proxies during scraping

dibodin opened this issue · 11 comments

Hello

i find the load of the list of proxies in from_crawler (middleware.py) : the load is in a constructor of object.

i read this in a good site of scraping : " ...write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration."

i wish change dynamically the list of proxie or complete it during the scraping. i think : il is a goog feature.

Best Regards.

(sorry for my english ...)

Hello! If I understand it correctly, you want to dynamically load some proxy lists from the internet to always have the latest proxies.

Anyway, what you can do is defining a custom middleware and custom proxies class:

from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies
from twisted.internet import task

class CustomRotatingProxiesMiddleware(RotatingProxyMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        mw = super(CustomRotatingProxiesMiddleware, cls).from_crawler(crawler)
        # Substitute standart `proxies` object with a custom one
        mw.proxies = CustomProxies(mw.cleanup_proxy_list(proxy_list), backoff=mw.proxies.backoff)

        # Connect `proxies` to engine signals in order to start and stop looping task
        crawler.signals.connect(mw.proxies.engine_started, signal=signals.engine_started)
        crawler.signals.connect(mw.proxies.engine_stopped, signal=signals.engine_stopped)
        return mw

class CustomProxies(Proxies):
    
    def engine_started(self):
        """ Create a task for updating proxies every hour """
        self.task = task.LoopingCall(self.update_proxies)
        self.task.start(3600, now=True)

    def engine_stopped(self):
        if self.task.running:
            self.task.stop()

    def update_proxies(self):
        new_proxies = ...  # fetch proxies from wherever you want
        for proxy in new_proxy_list:
            self.add(proxy)
        
    def add(self, proxy):
        """ Add a proxy to the proxy list """
        if proxy in self.proxies:
            logger.warn("Proxy <%s> is already in proxies list" % proxy)
            return

        hostport = extract_proxy_hostport(proxy)
        self.proxies[proxy] = ProxyState()
        self.proxies_by_hostport[hostport] = proxy
        self.unchecked.add(proxy)

In settings.py do you simply replace 'rotating_proxies.middlewares.RotatingProxyMiddleware' with 'YourProject.middlewares.CustomRotatingProxiesMiddleware' and how about the settings.py options?

@victor-wyk I think replacing the original middleware with the custom one should do the work.

Thanks for the quick response. I did some fiddling and found out that after replacing with the custom one, you do have to supply the ROTATING_PROXY_LIST option with a list of proxies, or else the custom one would not run. The custom middleware will ignore the list and continue to run like usual. How to solve the issue?

`DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  

'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,  

'MyProject.middlewares.MyProjectDownloaderMiddleware': 543,  

'MyProject.middlewares.CustomRotatingProxiesMiddleware': 610,  

# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,  

'rotating_proxies.middlewares.BanDetectionMiddleware': 620,  

}

ROTATING_PROXY_LIST = ['69.69.69.69:69']`

@victor-wyk when you run the spider, do you see the CustomRotatingProxiesMiddleware in the list logged after [scrapy.middleware] INFO: Enabled downloader middlewares:?

@StasDeep I do, only if i include the ROTATING_PROXY_LIST as shown above. If i get rid of the option then it does not appear.

@victor-wyk but what goes wrong then? As in, what's expected and what's actual?

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep

I get an NameError: name 'proxy_list' is not defined when implementing that custom middleware. Also new_proxy_list, logger, extract_proxy_hostport, and ProxyState are all not defined... @StasDeep

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep