This project facilitates the use of on-time and delayed flight data, available from the US Department of Transport's Bureau of Transportation Statistics, for DBMS testing and benchmarking.
Specifically, it comprises:
- A script for automating the tasks of downloading the zipped tabular data (available through here), mildly cleaning it, and loading it into a database with a (mostly) proper schema.
- A set of queries are used as the benchmark, to be executed on all data for years 2000-2008, with correct query result files included as well, as reference.
- Miscellaneous additional potentially useful scripts and SQL queries.
This work is inspired (or rather, necessitated) by a post from several years ago on Percona's performance blog, comparing MonetDB with InfoBright on the set of queries included here. They are perhaps not the most thorough or insightful benchmark queries one could device for this data, but that's what we're starting with.
Currently, only MonetDB is supported as the DBMS into which data is to be loaded.
- Internet connection (specifically HTTP)
- The Bourne Again Shell - bash
- various typical Unix-ish command-line tools: unzip, wget, sed, echo and so on.
- MonetDB installed and running
- Enough disk space for the data you want
- Set up a MonetDB 'Database Farm' (see the MonetDB tutorial if you're not sure how to do that).
- Make sure you have your MonetDB credentials (username and password) in
~/.monetdb
(otherwise you will be prompted for credentials for each of dozens of files when loading the date). - Invoke
scripts/setup-usdt-ontime-db
to create and populate DB with data from 2000 through 2008; the script's command-line options are as follows:
Options:
-r, --recreate If the benchmark database exists, recreate it, dropping all
existing data. (If neither recreate nor append are set, the
database be missing.)
-a, --append If the benchmark database exists, append to it rather than
recreating it. (If neither recreate nor append are set, the
database be missing.)
-d, --db-name NAME Name of the database holding on-time performance data
-f, --db-farm PATH Filesystem path for the root directory of the MonetDB
database farm in which to place the database
-p, --port NUMBER IP port on which to connect to the database
-k, --keep-downloads Keep the zipped CSV files downloaded from the US department
of transport's website after loading them into the DB
-s, --fix-sort-order The order of appearance of records in the CSV files is not-uniform;
With this option (on by default) the CSVs will be resorted,
all by the same fixed order.
-i, --ignore-sort-order The order of appearance of records in the CSV files is not-uniform;
With this option, ) the CSVs will be loaded as-is
-v, --verbose Be verbose about actions taken and current status
-h, --help Print usage information
-D, --download-dir Directory into which individual monthly data file will be
downloaded and decompressed before being loaded into the DB
(default: /export/scratch1/home/eyalroz/src/usdt-ontime-tools/usdt-ontime-downloads)
--first-year YEAR First year for which to download data
--first-month MONTH First month in first year for which to download data
--last-year YEAR Last year for which to download data
--last-month MONTH Last month in last year for which to download data
- Execute
scripts/run_benchmark_queries.sh -v
as a sanity check, to make sure you get results that look like the expected answer (you can also diff-compare the results you get withscripts/run_benchmark_queries.sh -w
to the reference results inexpected_results/
).
Feel free to open an issue or write me.