Star Schema Benchmark tools (for MonetDB)

This project facilitates work with the Star Schema Benchmark (SSB) on a DBMS (currently, MonetDB).

Table of contents
What's actually in here? About the Star Schema Benchmark Requirements Getting started

What's actually in here?

The repository comprises:

A script for generating the SSB data using a generation utility, creating a new database with the SSB schemata, and loading the generated data
The set of benchmark queries (with correct query result files to be added) - both by query X.Y number and by increasing order as simple numbers
A submodule link to the data generation utility, ssb-dbgen; it will be pulled together with this repository and you should be able to just use it with no setup (although it does have its own requirements, such as a C compiler)
Miscellaneous additional potentially useful scripts and SQL queries.

This repository is inspired by similar efforts of mine for TPC-H and for the USDT-Ontime data set and sort-of-a benchmark.

Only MonetDB is supported as a DBMS right now, and I have no immediate plans to add support for another DBMS - but you're welcome to open an issue and ask for one, or better yet - submit a pull request. It's just some bash scripting after all.

About the Star Schema Benchmark

The Star Schema Benchmark is a modification of the TPC-H benchmark, which is the Transaction Processing Council's (older) benchmark for evaluating the performance of Database Management Systems (DBMSes) on analytic queries - that is, queries which do not modify the data.

The TPC-H has various known issues and deficiencies which are beyond the scope of this document. Researchers Patrick O'Neil, Betty O'Neil and Xuedong Chen, from the University of Massachusats Boston, proposed a modification of the TPC-H benchmark which addresses some of these shortcomings, in several papers, the latest and most relevant being Star Schema Benchmark, Revision 3 published June 2009. One of the key features of the modifcation is the conversion of the TPC-H schemata to Star Schemata ("Star Schema" is a misnomer), by some denormalizing as well as dropping some of the data; more details appear below and even more details in the paper itself.

The benchmark was also accompanied by the initial versions of the code in this repository - a modified utility to generate schema data on which to run the benchmark.

For a recent discussion of the benchmark, you may wish to also read A Review of Star Schema Benchmark, by Jimi Sanchez.

Requirements

Internet connection (specifically HTTP)
The Bourne Again Shell - bash
various typical Unix-ish command-line tools: unzip, wget, sed, echo and so on.
MonetDB installed and able to run
Enough disk space for the data you want

Getting started

Make sure you have a MonetDB 'Database Farm' set up (see the MonetDB tutorial if you're not sure how to do that)
Invoke scripts/setup-ssb-db to create and populate DB with data from 2000 through 2008; the script's command-line options are as follows:

  Options:
    -r, --recreate              If the SSB database exists, recreate it, dropping
                                all existing data. (If this option is unset, the 
                                database must not already exist)
    -s, --scale-factor FACTOR   The amount of test data to generate, in GB
    -G, --use-generated         Use previously-generated table load files (in the
                                data generation directory instead of re-generating
                                them using the dbgen utility.
    -g, --dbgen-dir             Look for the SSB data generation utility in the
                                specified directory.
    -l, --log-file FILENAME     Name of the file to log output into
    -d, --db-name NAME          Name of the database holding SSB test data
                                within the DB farm
    -f, --db-farm PATH          Filesystem path for the root directory of the DB farm
                                with the generated DB
    -p, --port NUMBER           Network port on the local host, which the server
                                will related to the DB farm
    -D, --data-gen-dir PATH     directory in which to generate the SSB table data
    -k, --keep-raw-tables       Keep the raw data generated by the tool outside of
                                the DBMS

(May not yet be supported) Execute scripts/run_benchmark_queries.sh -v as a sanity check, to make sure you get results that look like the expected answer (you can also diff-compare the results you get with scripts/run_benchmark_queries.sh -w to the reference results in expected_results/).

Questions? Requests? Feedback? Bugs?

Feel free to open an issue or contact me.

eyalroz/ssb-tools