Python scripts for Retrosheet data downloading and parsing.
-
Chadwick 0.6.2 http://chadwick.sourceforge.net/
-
python 2.5+ (don't know about 3.0, sorry)
-
sqlalchemy: http://www.sqlalchemy.org/
-
[if using postgres] psycopg2 python package (dependency for sqlalchemy)
cp scripts/config.ini.dist scripts/config.ini
Edit scripts/config.ini
as needed. See the steps below for what might need to be changed.
python download.py [-y <4-digit-year> | --year <4-digit-year>]
The scripts/download.py
script downloads Retrosheet data. Edit the config.ini file to configure what types of files should be downloaded. Optionally set the year to download via the command line argument.
-
download
>dl_eventfiles
determines if Retrosheet Event Files should be downloaded or not. These are the only files that can be processed byparse.py
at this time. -
download
>dl_gamelogs
determines if Retrosheet Game Logs should be downloaded or not. These are not able to be processed byparse.py
at this time.
python parse.py [-y <4-digit-year>]
After the files have been downloaded, parse them into SQL with parse.py
.
-
Create database called
retrosheet
(or whatever). -
Add schema to the database w/ the included SQL script (the .postgres.sql one works nicely w/ PG, the other w/ MySQL)
-
Configure the file
config.ini
with your appropriateENGINE
,USER
,HOST
,PASSWORD
, andDATABASE
values - if you're using postgres, you can optionally defineSCHEMA
and download directory-
Valid values for
ENGINE
are valid sqlalchemy engines e.g. 'mysql', 'postgresql', or 'sqlite', -
If you have your server configured to allow passwordless connections, you don't need to define
USER
andPASSWORD
. -
If you are using sqlite3,
database
in the config should be the path to your database file. -
Specify directory for retrosheet files to be downloaded to, needs to exist before script runs
-
-
Run
parse.py
to parse the files and insert the data into the database. (optionally use-y YYYY
to import just one year)
Instead of editing the config.ini
file, you may, optionally, use environment variables to set configuration options. Name the environment variables in the format <SECTION>_<OPTION>
. Thus, an environment variable that sets the database username would be called DATABASE_USER
. The environment variables overwrite any settings in the config.ini
file.
Example,
$ DATABASE_DATABASE=rtrsht_testing CHADWICK_DIRECTORY=/usr/bin/ python parse.py -y 1956
Github user jeffcrow made many fixes and additions and added sqlite support
If you're using PostgreSQL (and you should be), you can get a dump of all data up through 2016 (warning: 521MB) here