
A package for supporting proxy in Scrapy & Gerapy

Primary LanguagePython

Gerapy Proxy

This is a package for supporting proxy with async mechanism in Scrapy, also this package is a module in Gerapy.


pip3 install gerapy-proxy


If you have a ProxyPool which can provide a random proxy for every request, you can use this package to integrate proxy into your Scrapy/Gerapy Project.

For example, there is a ProxyPool API which can return a random proxy per time, we can configure GERAPY_PROXY_POOL_URL setting provided by this package to enable proxy for every Scrapy Request.

To use this package, firstly install it and then enable it in DownloadMiddleware:

    'gerapy_proxy.middlewares.ProxyPoolMiddleware': 543,

and add proxy url in settings:

GERAPY_PROXY_POOL_URL = 'https://proxypool.scrape.center/random'

This ProxyPool is configured based on this ProxyPool repo, you can also build your own ProxyPool service.

Now, you've finished it.

The ProxyPoolMiddleware will firstly fetch a proxy from GERAPY_PROXY_POOL_URL and set meta.proxy attribute to Scrapy Reqeust.


Basic Auth

If your ProxyPool has Basic Auth, you can enable it by configuring these settings:


Min Retry Times

If you want to enable Proxy depends on the retry times, you can configure this settings:


Then proxy will only work if the retry times of Request greater or equal than 2.

Random Enabled

If you want to enable the proxy randomly, you can configure the probability of enabling it:


Then probability of enabling the proxy is 80%, if you configure it to 1, proxy will always be enabled.

Fetch Timeout

You can also configure the max time of fetching proxy from ProxyPool:


After configuring this, if Proxy Pool does not return result in 5s, proxy will not be used.

ProxyPool Response Parser

Your ProxyPool may not return the same format as this in plain text, you can also define a parser to extract proxy from your ProxyPool.

For example, if your ProxyPool return this for every request:

  "host": "",
  "port": 3128

You can define a method like:

import json
def parse_result(text):
    data = json.loads(text)
    return f'{data.get("host")}:{data.get("port")}'

Then you will get the proxy with correct format.


For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-proxy-example


2020-07-15 19:17:34 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2020-07-15 19:17:34 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit
2020-07-15 19:17:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-15 19:17:34 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'NEWSPIDER_MODULE': 'example.spiders',
 'SPIDER_MODULES': ['example.spiders']}
2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet Password: 33299ca0ce64f215
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled extensions:
2020-07-15 19:17:34 [asyncio] DEBUG: Using selector: KqueueSelector
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled spider middlewares:
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled item pipelines:
2020-07-15 19:17:34 [scrapy.core.engine] INFO: Spider opened
2020-07-15 19:17:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet console listening on
2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy
2020-07-15 19:17:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://httpbin.org/delay/3> (referer: None)
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:40 [example.spiders.httpbin] INFO: got request from successfully, current page 1
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy
2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://httpbin.org/delay/3> (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://httpbin.org/delay/3> (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy