Arachne is meant to be a next generation version of hiispider, a flexible web spider written at hiidef for flavors.me. It features a very similar high level architecture, but implements them differently to achieve a few important objectives:
- HTTP interfaces should be rich and easily extendable
- Plugins should be easy to run synchronously
- DRY-ness in the resultant plugin code
- Should not depend on undocumented architectural decisions
Asynchronicity is achieved with gevent, which should be patched by users of arachne. Without patching, arachne behaves synchronously and nearly all of its clients and libraries are usable from the python shell.
Arachne is split up into 3 major pieces:
- A
scheduler
which puts jobs on a queue - A
worker
which executes scheduled jobs - An
interface
which runs jobs on demand via HTTP
Jobs are all tied to methods implemented in plugins. Arachne makes certain basic assumptions and decisions, and will take care of these problems:
- Mapping URLs to plugin methods
- Basic plugin execution and result storage
- Registration and lookup for available plugins
- Associating a run-interval (every n seconds) with each plugin method
- Daemonization, start/stop/restart & pidfiles
You will have to decide:
- What a "job" looks like coming on and off the queue
- Where and how to store plugin results
- How to schedule those jobs
- How to store data necessary to run the jobs
Arachne comes with a number of batteries included:
- a simple no-magic configuration management system
- a rich http library, based on requests with:
- header caching on a pluggable backend (eg. memcached)
- header-based json/xml parsing with forced overrides
- OAuth 1.0a helpers (via requests-oauth)
- alternate
session
style helpers w/ with base-url support
- a memcached wrapper based on ultramemcache
- a mysql wrapper based on ultramysql
- an AMQP client based on kombu and amqplib
- a cassandra client based on pycassa
All of these clients will attempt to auto-configure with arachne's configuration management system.