Development of these components has moved to https://github.com/ukwa/wren and related repositories.
An experiment aimed at building scaleable, modular web archive components based on Docker containers.
- The warcprox Dockerfile sets up warcprox on Ubuntu 14.04/Python 3.4 with the necessary dependencies.
- The Squid caching forward proxy is used to set up a Cache Hierarchy, but instead of caching the results, the 'parent' proxies can be instances of warcprox.
- This should allow proxy-based web archiving to be used on large scale crawls.
- Note that it may be possible to use the caching feature of Squid to avoid hitting the original site too often when extracting transcluded URLs.
- HAProxy in HTTP mode can redirect based on
hdr(host)
,uri
, etc. (but not in TCP mode).
To experiment with scaling out, first clean out any existing machines:
$ docker-compose rm
Then define how many warcprox instances you want and ask for them to be configured:
$ docker-compose scale warcprox=3
Then when you run
$ docker-compose up
The system will start up and configure a HAProxy instance that is configured to balance the load across all the warcprox instances. The provided configuration divides the load up using hdr(host)
, which send all requests relating to a particular host to the same warcprox instance. This ensures that URL-based de-duplication can work effectively. Further experimentation with the load balancing parameters is recommended.
- Use a shared data volume container to hold the WARCs.
- Various web archiving components may benefit from having the CDX index as an independent, scaleable service rather than the usual files.
- If the CDX server also present an API for updating its index, as well as reading it, it can act as a core, standalone component in a modular architecture.
- Potential uses include: playback, de-duplication, 'last seen' state during crawls.
- The tinycdxserver Dockerfile sets up NLA's read/writable Remote Resource Index server (based on RocksDB) for experimentation.
- The read-only CDX servers (pywb,OpenWayback), could be unified and extended in this direction.
- Note that warcbase and OpenWayback can be used together for very large indexes that are best stored in HBase.
- Brozzler, an experimental distributed browser-based web crawler build on Docker which works along similar lines.
- Various Dockerised OpenWayback images, LOCKSS, UNB Libraries, Sawood Alam.