This is a simple fantasy football data scraper I put together using scrapy
and postgres
(via the psyocpg2
library). Currently only ESPN fantasy football league (FFL) data is supported, but It could pretty easily be extended to other leagues by defining other spiders.
The scrapy
workflow is pretty simple: you define some places to start (start_urls
) and a way of parsing the html responses you would get if you requested those urls (a function called, oddly enough, parse
). The module then iteratively loads urls, parses out the score projection information ESPN has for players, and persists that information to a local postgres database.
In my instance, I was interested in urls of the form http://games.espn.go.com/ffl/tools/projections?&leagueId={}
, formatting the {} to have my particular fantasy league's brackets. One caveat: this will only work if you have a leagueId, I think -- it could (and should) be extended to work without one, but it would change the table format of the resulting html, so I'll punt (pun intended) for now. If you pull up a page like that, it will only be the first 40 or 60 elements, but of course I want all of the subsequent table pages (&startIndex={40,80,120,...}
). To accomplish this, I do two things:
- I define a parse function that looks at a given 40-element table page and parses each row into a
scrapy Item
class, and - At the end of the parsing of that page, I look for a "NEXT" link on that page and submit that as a follow-up
scrapy
request.
My parse
function is pretty stupid -- mostly just using xpath
expressions to jump around in the response
object and doing some minor string manipulations. This item is then yielded up to the global scrapy process, which in turn passes that item through whatever pipelines have been defined. Here, I do something more complicated than the basic pring-to-screen pipeline -- I persist the details of that item to a table raw_data
in a local postgres database ffldata
.
First, of course, you have to clone this bad boy. Next, you have to prepare a python and a postgres environment
You need to create an environment in which to execute the python code. I'm currently running anaconda, so my requirements.txt file is an anaconda requirements file (that is, you can run
# creating a conda environment
conda create --name <env> --file requirements.txt
to create a suitable environment.
If you're using virtualenvs / pip, I believe requirements.pip.txt will work for you as well (but won't be keeping that as up to date as the anaconda req file).
# virtualenv / pip environment
virtualenv /path/to/venv
source /path/to/venv/bin/activate
pip install -r requirements.pip.txt
Hopefully you have your own postgres instance which you can access and alter. If you can't email me and I can help you and whomever your admin is navigate the process of setting it up.
Assuming you have access to the postgres user, you will first need to create the user and database and build the table. This can be done relatively easily using the bootstrap_postgres.sql
file I've provided:
sudo su - postgres
psql -f /path/to/cloned/repo/bootstrap_postgres.sql
After this, assuming you have root acccess, you will need to update the host based authentication properites of the ffldata
user we just created by adding the line at the bottom:
# the following contents are from the file /etc/postgres/#.#/main/pg_hba.conf
# Database administrative login by Unix domain socket
local all postgres peer
local ffldata ffldata md5
to your pg_hba.conf
file
With those pieces out of the way, you are free to scrape to your heart's content!
Start the scraper with
scrapy crawl espn [-a KEY=VALUE] [--set KEY=VALUE]
The -a
flag is for parameters to pass directly to the spider, and there is only one of those:
wipeTable
: a boolean, weather or not we should first drop all rows inraw_input
. Note that running the web scraper for a second time without doing this will fail, as all rows have already been added and second attempts will violated the primary key constraint for this table
The --set
parameters are for overridding those set in ffl_data_scraper\settings.py
, and which are used throughout the module (not just the spider). Notable KEY, VALUE
options set this way are:
LEAGUE_ID
: if you have your own league id in ESPN FFL, you can pass it here at the command line and your league's scoring method will be taken into accoutn.PG_USER
,PG_DBNAME
,PG_HOST
, andPG_PASSWORD
: if set, these parameters will be used to make connections to the postgres database we created above.
A lot message will be written out to /path/to/repo/ffl_data_scraper.log
, and you should be able to verify that the persistence to the postgres database worked by running
# in bash
$ psql -u ffldata
-- in psql
> SELECT count(*) FROM raw_data WHERE ffl_source = 'espn';
and having approximately 1660 rows.