A web crawler. Ergo the name. See? Spidey? Got it? ... Sorry. I'm leaving.
A scraping that gathers information from articles on Gizmodo US, saves them to a DB and serves the info through a REST API.
spidey is now up on martinvb.com/spidey
.
Running a GET
on martinvb.com/spidey
yields the following metadata about all scraped posts, serialized in a JSON package:
title # Post title
author # Post's author
url # Post URL
post_id # Post ID
post_date # Post date
keyworkds # Post keywords
description # Post short description (single paragraph)
To read the full text on any individual post, GET
martinvb.com/spidey/<post_id>
.
Load the page's JSON response into a viewer, if you wish.
Spidey depends on django, django-rest-framework and scrapy. To get them (with pip):
pip install django
pip install djangorestframework
pip install scrapy
Alternatively, you may use the supplied requirements.txt
file to install the dependencies, by running:
pip install -r path/to/spidey/requirements.txt
Note that scrapy depends on lxml. pip will run the required compilation/installation, but the compilation may fail on machines with 512MB of RAM (as it did on my VPS). If that's the case, add a swap file as nicely detailed here.
After installing the deps, get this repo with:
git clone https://github.com/MartinBloedorn/spidey.git
To ensure that the directory is writable (specially the sqlite database), execute:
sudo chown -R $USER:$USER spidey/
sudo chmod -R 755 spidey/
To get it up and running (with sqlite), execute:
cd spidey
python manage.py makemigrations # creates initialization files for the db
python manage.py migrate # initializes the tables on the db
python manage.py runserver # fires the app up
Gets the server running on localhost:8000
(by default).
By default, spidey (i.e., django) listens on localhost:8000
. If spidey is on a VPS running Apache that hosts example.com
, you may want to redirect example.com/spidey
to localhost:8000
- making spidey's API available on that URL (some useful info is available here).
To do so, edit the example.com
configuration file (may be example.com.conf
, or by default 000-default.conf
), usually under /etc/apache2/sites-enabled
. In the file, add the following lines to the definition of the VirtualHost:
<VirtualHost *:80>
# ...
# Add the lines below:
ProxyPass /spidey http://127.0.0.1:8000
ProxyPassReverse /spidey http://127.0.0.1:8000
</VirtualHost>
Make sure the following modules are loaded into Apache (using a2enmod
):
mod_proxy
mod_proxy_http
mod_ssl
Restart Apache (in Debian-based systems: sudo service apache2 restart
). Good to go!
To have the crawler run every second hour, one simple solution is to add it as a cron job.
In the spidey repo, launch_crawler.sh
calls the crawler. Mark it as executable with chmod a+x launch_crawler.sh
. Add it as a cron job. Run:
crontab -e
In the editor, add the following line:
* */2 * * * /path/to/spidey/launch_crawler.sh
The * */2 * * *
part sets the script to be executed every two hours.
Though platforms such as Heroku and DeployBot exist to enable decent automatic deployments, a simple/crippled/fairly-dumb alternative can be scripted directly into Git's hooks.
Let's start by hosting a spidey repo on the target VPS (intersting info on that here). Suppose
a user johndoe
:
johndoe@vps $ cd ~
johndoe@vps $ git clone --bare https://github.com/MartinBloedorn/spidey.git spidey-bare
The --bare
options makes a full copy of the repo, enabling it to be pushed to. To easily access the HEAD version of spidey however, let's add a local repository tracking this local bare repo:
johndoe@vps $ git clone spidey-bare spidey # clones a regular repo off of spidey-bare
Everytime spidey-bare
receives a push, it may run various hooks (callbacks, if you will). We're interested in the post-update
hook. To make git run it, execute:
mv spidey-bare/hooks/post-update.sample spidey-bare/hooks/post-update
chmod a+x spidey-bare/hooks/post-update
With your editor of choice, edit spidey-bare/hooks/post-update
to look like:
# Inside a Hook, git repos are '.', but we'll temporarily set them to '.git'
GIT_DIR='.git'
# Store the current working directory, to return to it
CWD="$(pwd)"
# Fuser kills the process on port 8000; in this case, python/django
fuser -k 8000/tcp
# Go to the local spidey repo (not spidey-bare)
cd /home/johndoe/spidey
# Get the changes pushed to the bare repo, overwrite local changes
git fetch origin
git reset --hard origin/master
# Setup and run the server
cd spidey
python manage.py makemigrations
python manage.py migrate
python manage.py runserver &
# Undo changes
cd $CWD
GIT_DIR='.'
# Return info about the push to the user (if you skip this line, `git push deploy` will hang)
exec git update-server-info
# Exit script
exit 0
Needless to say, this script is dumb and is as robust as a castle of cards on a plane's wing. It's just a bare-minimum starting point for a simple automatic deployment for spidey. Logging and checking are welcome.
Lastly, in the repository on its local machine, johndoe
needs to add the path to spidey-bare
. Suppose the VPS hosts example.com
:
johndoe@local $ git remote add deploy johndoe@example.com:spidey-bare
Now, local modifications can be pushed to the server and automaticaly deployed (theoretically) with
johndoe@local $ git push deploy
The REST portion of this tool is implemented using the django-rest-framework, which sits on top of the django framework.
The scraper/crawler is implemented using the scrapy framework.