pip install scrapy_proxy_pool
Enable this middleware by adding the following settings to your settings.py:
PROXY_POOL_ENABLED = True
Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:
DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610, 'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620, # ... }
After this all requests will be proxied using proxies.
Requests with "proxy" set in their meta are not handled by
scrapy-proxy-pool. To disable proxying for a request set
request.meta['proxy'] = None
; to set proxy explicitly use
request.meta['proxy'] = "<my-proxy-address>"
.
By default, all default Scrapy concurrency options (DOWNLOAD_DELAY
,
AUTHTHROTTLE_...
, CONCURRENT_REQUESTS_PER_DOMAIN
, etc) become
per-proxy for proxied requests when RotatingProxyMiddleware is enabled.
For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2
then
spider will be making at most 2 concurrent connections to each proxy,
regardless of request url domain.
scrapy-proxy-pool
keeps track of working and non-working proxies from time to time.
Detection of a non-working proxy is site-specific.
By default, scrapy-proxy-pool
uses a simple heuristic:
if a response status code is not 200, 301, 302, 404, 500, response body is empty or if
there was an exception then proxy is considered dead.
You can override ban detection method by passing a path to
a custom BanDectionPolicy in PROXY_POOL_BAN_POLICY
option, e.g.:
# settings.py PROXY_POOL_BAN_POLICY = 'myproject.policy.MyBanPolicy'
The policy must be a class with response_is_ban
and exception_is_ban
methods. These methods can return True
(ban detected), False (not a ban) or None (unknown). It can be convenient
to subclass and modify default BanDetectionPolicy:
# myproject/policy.py from scrapy_proxy_pool.policy import BanDetectionPolicy class MyPolicy(BanDetectionPolicy): def response_is_ban(self, request, response): # use default rules, but also consider HTTP 200 responses # a ban if there is 'captcha' word in response body. ban = super(MyPolicy, self).response_is_ban(request, response) ban = ban or b'captcha' in response.body return ban def exception_is_ban(self, request, exception): # override method completely: don't take exceptions in account return None
Instead of creating a policy you can also implement response_is_ban
and exception_is_ban
methods as spider methods, for example:
class MySpider(scrapy.Spider): # ... def response_is_ban(self, request, response): return b'banned' in response.body def exception_is_ban(self, request, exception): return None
It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.
PROXY_POOL_ENABLED
- Whether enable ProxyPoolMiddleware;PROXY_POOL_FILTER_ANONYMOUS
- whether to use anonymous proxy, False by default;PROXY_POOL_FILTER_TYPES
- which proxy types to use, only 'http' and 'https' is available. ['http', 'https'] by default;PROXY_POOL_FILTER_CODE
- which proxy country code to use. 'us' by default;PROXY_POOL_REFRESH_INTERVAL
- proxies refresh interval in seconds, 900 by default;PROXY_POOL_LOGSTATS_INTERVAL
- stats logging interval in seconds, 30 by default;PROXY_POOL_CLOSE_SPIDER
- When True, spider is stopped if there are no alive proxies. If False (default), then when there is no alive proxies all dead proxies are re-checked.PROXY_POOL_FORCE_REFRESH
- When True, spider will force refresh proxies if there are no alive proxies. If False (default), then when there is no alive proxies send request with host ipPROXY_POOL_PAGE_RETRY_TIMES
- a number of times to retry downloading a page using a different proxy. After this amount of retries failure is considered a page failure, not a proxy failure. Think of it this way: every improperly detected ban cost youPROXY_POOL_PAGE_RETRY_TIMES
alive proxies. Default: 5.It is possible to change this option per-request using
max_proxies_to_try
request.meta key - for example, you can use a higher value for certain pages if you're sure they should work.PROXY_POOL_TRY_WITH_HOST
- When True, spider will try requests that exceed PROXY_POOL_PAGE_RETRY_TIMES.PROXY_POOL_BAN_POLICY
- path to a ban detection policy. Default is'scrapy_proxy_pool.policy.BanDetectionPolicy'
.