NB: this repository was migrated from Bitbucket. Issues were migrated by hand; pull request comments have not been migrated.
SSH into a fresh dataset, then run the following commands:
mkdir ~/BAK && mv ~/tool ~/incoming ~/http ~/BAK
git clone git@github.com:scraperwiki/data-services-scraper-template.git
./data-services-scraper-template/install.sh .
Now you have a brand new local git repo at the top level. It is safe to remove the template repo with this command:
rm -rf ./data-services-scraper-template
Now head over to GitHub and create a new private repository for this scraper:
Select "I have an existing project to push up".
This will give you a 'git remote' command for linking the local repo to the new remote repo - you should run this now.
Back in your box, edit ./README.md with the URL of the new repo in GitHub.
Now do your initial commit and push to GitHub:
git status
git push -u origin --all # to push up the repo for the first time
Finally, activate your scraper (enable crontab, create virtualenv etc) by running the following command:
./tool/first_run.sh
The entry point for your code is tool/main.py
. If you're scraping a single
type of data you should probably keep all the code in that one file.
Sample data lives in tool/sample_data
and should encompass every type of
file that the code works on (if you think this is excessive, try debugging in
6 months when the site has changed and you don't know how it used to look...)
Tests live in tool/tests.py
and can be invoked from the tool/
directory
by running nosetests
.
Requirements live in requirements.txt
which is automatically installed into
the virtualenv when you use tool/first_run.sh
In production, you should use ~/run.sh
to actually invoke the code. It's
how cron runs the code, and deliberately takes no arguments so that you can
easily reproduce what cron does.
Be aware that run.sh
redirects output to the log/
directory (see
~/log/latest
and supresses stdout unless there's an error, ie your
program exits with a nonzero error code.
You should try and stick to the convention of using a process(f)
function
which takes a file-like object (ie a web-page) and yields OrderedDicts
representing rows to save to the database.
Using file-like objects allows us to access process(f)
from both
main.py
and tests.py
without having to load huge files into memory.
You should favour logging.info(..)
over print(..)
as we use the logging
basic configuration by default.
Do use at least info and debug log levels appropriately throughout the code.
Everything within this template repo is licenced according to the top-level LICENCE file.
Any new scraper which is generated from this template can have its own licence.
By default this is the Sensible Code proprietary licence which lives within
the skel/
directory.
The LICENCE file inside skel/
is only a suggestion for scrapers built from
the template; it has no effect on this template repo.