This spider is set up to work with the VMs provided in the scrapy-vagrant
repo. The web VM in that repo should have an xkcd archive set up in /vagrant/web/html
. This spider is set up
to begin from the most recent page contained in the archive and continually scrape the pages pointed to by the
'Previous' links until it reaches the end of the archive.
If you haven't already, clone the scrapy-vagrant repo and follow the setup instructions there. If you've cloned it but haven't touched it in a while, pull it so you're up to date. Things may have changed recently, so make sure to look at the README for details.
On your local machine:
cd /path/to/scrapy-vagrant
cp /path/to/[xkcd archive] web
vagrant up web
vagrant ssh web
In the web VM:
cd /vagrant/web
tar zxvf [xkcd archive]
This will place the files in the archive in /vagrant/web/html
, exactly where they need to be.
If you follow these directions, the last part will place the contents of the archive in /vagrant/web/html
cd /path/to/scrapy-vagrant
vagrant up scrapy-vm web
vagrant ssh scrapy-vm
cd /vagrant/scrapy-vagrant/test-spider
scrapy crawl xkcd
This will start the spider, and you can watch it do its work. The cache will be saved in the .scrapy
subdirectory
under different subdirectories depending on the config options.
Configurations for this spider can be found in xkcd_test/settings.py
. If you take a
look there, you'll see there are comments near the end of the file about how to play with configs for HTTPCacheMiddleware
.