A simple, stable, scalable and swift general web crawler.
Simple
less than 5 core classes.Swift
100+ qps (requests per second) on avg.Pluggable Dynamic Proxy Pool
You can choose your own proxy pool freely, only need to add a few lines code.Adaptive Traffic Control
Easy Crawler will control the scraping speed to fit your setting.High Scalability
Easy to deploy as distributed crawlers.
- Python 3+
- Redis Server
<Easy Crawler> ---- (1) get proxy IP --- <Proxy Pool>
|
|
(2) pull and (3) push tasks
|
|
<Redis queue> ($REDIS_HOST:$REDIS_PORT)
# 1. Install requirements:
pip install -r requirements
# 2. Copy the .env:
cp .env.example .env
# 3. Modify the .env:
vi .env
# 4. Start redis server
sudo apt install redis-server # If you haven't installed redis-server
redis-server
# 5. Start proxy pool server
# This example doesn't need a proxy pool server. It will use a fake proxy pool. For a crawler with real proxy pool, you can jump to `Build-in Proxy Pool` below for reference.
# 6. Run the Minimal Crawler
# HINT: You must run these scripts in the root directory of REPO
# linux or mac
export PYTHONPATH=. && python crawlers/simple_crawler.py
# windows
set PYTHONPATH=.
python python crawlers/simple_crawler.py
-
Run the
simple_crawler
example in theQuick Start
to check if the custom crawler foundation works. -
cp crawlers/simple_crawler.py crawlers/YOUR_crawler.py
-
Modify the code in
YOUR_crawler.py
after reading the interface comments carefully. -
Run and enjoy your own crawler.
# linux or mac
export PYTHONPATH=. && python crawlers/YOUR_crawler.py
# windows
set PYTHONPATH=.
python crawlers/YOUR_crawler.py
- Install the proxy pool servers according to the guidance in their REPOs.
- Set the port in
.env
- Proxy Name:
haip
- Proxy Name:
jhao104
- Port: $JHAO104_PORT
- Proxy Name:
karmenzind
- Port: $KARMEN_PORT
- Proxy Name:
scylla
- Port: $SCYLLA_PORT
- Proxy Name:
scylla
- Port: $CJDX_PORT
Mix all of above pools together.
- Proxy Name:
mixed
- Port: $JHAO104_PORT, $KARMEN_PORT, $SCYLLA_PORT
Not use proxy.
- Proxy Name:
fake
- Create a
YOUR_PROXY_POOL.py
inproxy_pools
. - Add a
YOUR_PROXY_POOL
class, which should extendcore.proxy_pool.ProxyPool
. Don't forget to add a@register_proxy_pool("YOUR_PROXY_POOL_NAME")
decorator to your class. - Implement
collect_proxies
. You can overrideget_proxy
andfeedback_proxy
if necessary. - Run your crawler with
proxy_pool="YOUR_PROXY_POOL_NAME"
# crawlers/YOUR_Crawler.py
if __name__ == "__main__":
YOUR_Crawler.start(
task_name="YOUR_crawler",
proxy_pool="YOUR_PROXY_POOL_NAME",
...
)
# Run `python crawlers/YOUR_Crawler.py` to test your proxy pool.
Just run crawlers with same task_name
in each container. They will share the job queue in redis.
Todo.
Yi Ren (RayeRen)