/scrapydd

Scrapydd is a system for scrapy spiders distributed running and scheduleing system, including server and client agent.

Primary LanguagePythonApache License 2.0Apache-2.0

ScrapyDD (Scrapy Distributed Daemon)

PyPI Version Build Status Coverage report

Scrapydd is a system for scrapy spiders distributed running and scheduleing system, including server and client agent.

Advantages:

  • Distributed, easily add runner(agent) to scale out.
  • Project requirements auto install on demand.
  • Cron expression time driven trigger, run your spider on time.
  • Webhook loosely couple the data crawling and data processing.
  • Spider status insight, system will look into the log to clarify spider run status.

Installing Scrapydd

By pip:

pip install scrapydd

You can also install scrapydd manually:

  1. Download compressed package from github releases.
  2. Decompress the package
  3. Run python setup.py install

Run Scrapydd Server

scrapydd server

The server default serve on 0.0.0.0:6800, with both api and web ui. Add --daemon parameter in commmand line to run in background.

Run Scrapydd Agent

scrapydd agent

Add --daemon parameter in commmand line to run in background.

Docs

The docs is hosted here

Docker-Compose

version: '3'
services:
  db:
    image: mysql
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    ports:
      - "3306:3306"
    volumes:
      - "datadb:/var/lib/mysql"
    environment:
      MYSQL_ROOT_PASSWORD: mysqlPassword
      MYSQL_DATABASE: scrapydd
      MYSQL_USER: scrapydd
      MYSQL_PASSWORD: scrapyddPwd

  server:
    image: "kevenli/scrapydd"
    ports:
      - "6800:6800"
    volumes:
      - "./server:/scrapydd"
      - "/var/run/docker.sock:/var/run/docker.sock"
    command: scrapydd server

  agent:
    image: "kevenli/scrapydd"
    volumes:
      - "./agent:/scrapydd"
      - "/var/run/docker.sock:/var/run/docker.sock"
    links:
      - server
    environment:
      - SCRAPYDD_SERVER=server
    command: scrapydd agent

volumes:
  datadb: