Scrapydd is a system for scrapy spiders distributed running and scheduleing system, including server and client agent.
- Distributed, easily add runner(agent) to scale out.
- Project requirements auto install on demand.
- Cron expression time driven trigger, run your spider on time.
- Webhook loosely couple the data crawling and data processing.
- Spider status insight, system will look into the log to clarify spider run status.
By pip:
pip install scrapydd
You can also install scrapydd manually:
- Download compressed package from github releases.
- Decompress the package
- Run
python setup.py install
scrapydd server
The server default serve on 0.0.0.0:6800, with both api and web ui. Add --daemon parameter in commmand line to run in background.
scrapydd agent
Add --daemon parameter in commmand line to run in background.
The docs is hosted here
version: '3'
services:
db:
image: mysql
command: --default-authentication-plugin=mysql_native_password
restart: always
ports:
- "3306:3306"
volumes:
- "datadb:/var/lib/mysql"
environment:
MYSQL_ROOT_PASSWORD: mysqlPassword
MYSQL_DATABASE: scrapydd
MYSQL_USER: scrapydd
MYSQL_PASSWORD: scrapyddPwd
server:
image: "kevenli/scrapydd"
ports:
- "6800:6800"
volumes:
- "./server:/scrapydd"
- "/var/run/docker.sock:/var/run/docker.sock"
command: scrapydd server
agent:
image: "kevenli/scrapydd"
volumes:
- "./agent:/scrapydd"
- "/var/run/docker.sock:/var/run/docker.sock"
links:
- server
environment:
- SCRAPYDD_SERVER=server
command: scrapydd agent
volumes:
datadb: