TeamHG-Memex/scrapy-rotating-proxies

use proxy with scrapy-splash

kadimon opened this issue · 6 comments

Hi! How use this proxy rotator and scrapy-splash together?

This settings don't work:

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 400,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
kmike commented

Currently there is no built-in way to do this.

A first option would be to use request.meta['splash']['args']['proxy'] instead of request.meta['proxy'] here:

def process_request(self, request, spider):
, and removing everything related to download_slot.

Second option would be to change the way request.meta['proxy'] is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.

fpun commented

Hi Mikhail,
I'm trying to use the rotating proxies middleware with a hosted headless chrome to render pages.
I tweaked middlewares.py so that it replaces request.url with the chrome instance url, passing the actual target url and the proxy generated by the middleware as url parameters (which are then handled by headless chrome).

The spider runs but I get banned as if no proxy was being used and looking at logs on the chrome server, it doesn't seem like the request are received so I'm guessing the middleware drops the requests which are passed like regular scrapy request.

Is there anything else I would need to update apart from removing this below?
request.meta['download_slot'] = self.get_proxy_slot(proxy)

Here is my code:

   def process_request(self, request, spider):
        if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'):
            return
        proxy = self.proxies.get_random()
        if not proxy:
            if self.stop_if_no_proxies:
                raise CloseSpider("no_proxies")
            else:
                logger.warn("No proxies available; marking all proxies "
                            "as unchecked")
                self.proxies.reset()
                proxy = self.proxies.get_random()
                if proxy is None:
                    logger.error("No proxies available even after a reset.")
                    raise CloseSpider("no_proxies_after_reset")

        if request.meta['type'] == 'browserless':           
            request.replace(body = json.dumps({'code':self.BROWSERLESS_EXEC_CODE,'context': {'url': request.url}}))
            request.replace(url = self.BROWSERLESS_URL + '&--proxy-server=' + proxy)
            request.replace(method = 'POST')
            request.replace(headers = {'Cache-Control':'no-cache','Content-Type':'application/json'})

        else:
            request.meta['proxy'] = proxy

        request.meta['_rotating_proxy'] = True

Currently there is no built-in way to do this.

A first option would be to use request.meta['splash']['args']['proxy'] instead of request.meta['proxy'] here:

def process_request(self, request, spider):

, and removing everything related to download_slot.
Second option would be to change the way request.meta['proxy'] is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.

Hi, I'm having the same need. I tried both options proposed:

  1. sub-class RotatingProxyMiddleware and implement first option
  2. sub class SplashMiddleware and implement second option
  1. works better and is simpler to implement.
  2. didn't seem to work well since SplashMiddleware re-enters the middleware chain and a new proxy gets assign every time so both a change in RotatingProxyMiddleware and SplashMiddleware would be required to make it work.

@kmike Do you think making RotatingProxyMiddleware SplashMiddleware aware would be something useful in scrapy-rotating-proxies. If so I could try making a PR for it.

thanks

Hi, I can't get it to work, tried this on rotating_proxies/middlewares.py
request.meta['splash']['args']['proxy'] = proxy
#request.meta['download_slot'] = self.get_proxy_slot(proxy)
Am I right to assume we have to change the rest of the process_request function and other functions as well?

Currently there is no built-in way to do this.
A first option would be to use request.meta['splash']['args']['proxy'] instead of request.meta['proxy'] here:

def process_request(self, request, spider):

, and removing everything related to download_slot.
Second option would be to change the way request.meta['proxy'] is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.

Hi, I'm having the same need. I tried both options proposed:

  1. sub-class RotatingProxyMiddleware and implement first option

  2. sub class SplashMiddleware and implement second option

  3. works better and is simpler to implement.

  4. didn't seem to work well since SplashMiddleware re-enters the middleware chain and a new proxy gets assign every time so both a change in RotatingProxyMiddleware and SplashMiddleware would be required to make it work.

@kmike Do you think making RotatingProxyMiddleware SplashMiddleware aware would be something useful in scrapy-rotating-proxies. If so I could try making a PR for it.

thanks

@mxdev88 can you post the solution you ended up with? (I guess this is the first option choice).
Since I have the same problem and I'd like to test your code, if possible.
Thanks

Currently there is no built-in way to do this.
A first option would be to use request.meta['splash']['args']['proxy'] instead of request.meta['proxy'] here:

def process_request(self, request, spider):

, and removing everything related to download_slot.
Second option would be to change the way request.meta['proxy'] is handled in scrapy-splash - instead of using this proxy for requests to Splash it may pass it as an argument to Splash.

Hi, I'm having the same need. I tried both options proposed:

  1. sub-class RotatingProxyMiddleware and implement first option
  2. sub class SplashMiddleware and implement second option
  3. works better and is simpler to implement.
  4. didn't seem to work well since SplashMiddleware re-enters the middleware chain and a new proxy gets assign every time so both a change in RotatingProxyMiddleware and SplashMiddleware would be required to make it work.

@kmike Do you think making RotatingProxyMiddleware SplashMiddleware aware would be something useful in scrapy-rotating-proxies. If so I could try making a PR for it.
thanks

@mxdev88 can you post the solution you ended up with? (I guess this is the first option choice). Since I have the same problem and I'd like to test your code, if possible. Thanks

  1. setting.py
...
 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'xxxxxx.middlewares.SplashRotatingProxyDownloaderMiddleware': 611,
...
  1. create SplashRotatingProxyDownloaderMiddleware into middlewares.py
    def process_request(self, request, spider):
        try:
            proxy = request.meta['proxy'] if request.meta and 'proxy' in request.meta else None
            if proxy and 'splash' in request.meta and not getattr(request.meta['splash']['args'], 'proxy', None):
                # proxy switch :) 
                request.meta['splash']['args']['proxy'] = proxy
                request.meta['proxy'] = None
                logger.debug(f'Serving Splash proxy: {proxy}')
        except Exception as e:
            logger.exception(e)
        return None