test-spider

This spider is set up to work with the VMs provided in the scrapy-vagrant repo. The web VM in that repo should have an xkcd archive set up in /vagrant/web/html. This spider is set up to begin from the most recent page contained in the archive and continually scrape the pages pointed to by the 'Previous' links until it reaches the end of the archive.

First-time setup

If you haven't already, clone the scrapy-vagrant repo and follow the setup instructions there. If you've cloned it but haven't touched it in a while, pull it so you're up to date. Things may have changed recently, so make sure to look at the README for details.

On your local machine:

cd /path/to/scrapy-vagrant
cp /path/to/[xkcd archive] web
vagrant up web
vagrant ssh web

In the web VM:

cd /vagrant/web
tar zxvf [xkcd archive]

This will place the files in the archive in /vagrant/web/html, exactly where they need to be.

If you follow these directions, the last part will place the contents of the archive in /vagrant/web/html

Usage

cd /path/to/scrapy-vagrant
vagrant up scrapy-vm web
vagrant ssh scrapy-vm
cd /vagrant/scrapy-vagrant/test-spider
scrapy crawl xkcd

This will start the spider, and you can watch it do its work. The cache will be saved in the .scrapy subdirectory under different subdirectories depending on the config options.

Configuration

Configurations for this spider can be found in xkcd_test/settings.py. If you take a look there, you'll see there are comments near the end of the file about how to play with configs for HTTPCacheMiddleware.

sgarciapdx/test-spider

test-spider

First-time setup

Usage

Configuration