Citus TPC-H tests

The idea is to compare different PostgreSQL Cloud-based offerings. This repository contains a partial implementation of TPC-H with Direct Loading support for PostgreSQL and some scripting to orchestrate highly concurrent tests.

Running the benchmarks

To run the Citus TPC-H benchmark, follow those steps:

Review the target infrastructure setup: infra.ini
Review the benchmark schedule and jobs: tpch.ini
Manually start a Citus cluster, and register its DSN in infra.ini
Give a name to your test setup: ./control.py setup infra.ini tpch.ini

That way, it's possible to run several benchmarks in parallel, with maybe very different infrastructure properties (db.r4.xlarge and db.t4.16xlarge comes to mind) and scale factors etc.

After all we're using Cloud providers, where the hardware is unlimited. We can run all the tests in parallel, in some level of isolation.

This command returns a name, keep that somewhere, you'll need it. Every becnhmark you want to run is given a name.
Choose what systems you want to compare/exercise in this benchmark
```
./control.py register <schedule> citus rds aurora
```
Run the tests
```
./control.py benchmark <name>
```

The tests are now running. It's possible to see what's happening, of course. Try the following commands:

./control.py tail <name>
./control.py status <name>
./control.py update <name>

For those commands to work you need to register a PostgreSQL database that is local to your controler node (usually your laptop):

createdb tpch-results

Then edit tpch.ini, the [results] section should contain a dsn entry where you can put the connection URI postgresql://tpch-results.

It's possible to list currently known benchmark runs:

$ ./control.py list
enticingly_detect currently has 3 systems registered, 3 running
    aurora: stage 13/vacuum analyze in a day with 184 queries 
     citus: stage 16/vacuum analyze in a day with 1120 queries 
       rds: stage 15/vacuum analyze in a day with 116 queries 

engage_busily is not currently running

promise_never currently has 2 systems registered, 2 running
    aurora: stage 11/vacuum analyze in 9 hours with 36 queries 
     citus: stage 8/vacuum analyze in 2 hours with 112 queries

Running the benchmarks locally

It's possible to use the TPC-H benchmarks driver with a local PostgreSQL database, without all the Cloud infrastructure in place. To that end, just use the tpch.py command directly, without using the Makefile based infrastructure management.

Here's an example of doing that:

$ DSN=postgresql:///tpch ./tpch.py benchmark pgsql --schedule quick
2018-02-12 11:20:22,544 INFO pgsql: starting benchmark sleep_slowly
2018-02-12 11:20:22,554 INFO pgsql: starting schedule initdb
2018-02-12 11:20:22,554 INFO pgsql: initializing the TPC-H schema
2018-02-12 11:20:22,554 INFO pgsql: create initial schema, pgsql variant
psql:./schema/cardinalities.sql:1: NOTICE:  view "cardinalities" does not exist, skipping
2018-02-12 11:20:22,707 INFO pgsql: loading 1 steps of data using 16 CPU: [1]
2018-02-12 11:20:28,836 INFO pgsql: loaded step 1/100 for Scale Factor 1
2018-02-12 11:20:28,838 INFO pgsql: vacuum analyze
2018-02-12 11:20:30,758 INFO pgsql: loaded 1 steps of data in 6.12964s, using 16 CPU
2018-02-12 11:20:30,759 INFO pgsql: install constraints from 'schema/tpch-pkeys.sql'
2018-02-12 11:20:30,912 INFO pgsql: install constraints from 'schema/tpch-index.sql'
2018-02-12 11:20:31,253 INFO pgsql: install constraints from 'schema/tpch-fkeys.sql'
2018-02-12 11:20:31,379 INFO pgsql: imported 1 initial steps in 8.82457s, using 16 CPU
2018-02-12 11:20:31,379 INFO pgsql: starting schedule stream
2018-02-12 11:20:31,379 INFO pgsql: Running TPCH with 16 CPUs for 5s, stream 1 4 6 12
2018-02-12 11:20:32,518 INFO pgsql: 1 query streams executed in 1.13002s
2018-02-12 11:20:34,036 INFO pgsql: 17 query streams executed in 2.64805s
2018-02-12 11:20:35,038 INFO pgsql: 17 query streams executed in 3.64983s
2018-02-12 11:20:36,575 INFO pgsql: 33 query streams executed in 5.1877s
2018-02-12 11:20:38,393 INFO pgsql: executed 49 streams (196 queries) in 6.99254s, using 16 CPU

To be able to reproduce this locally, you need to firts create a tpch-results database and configure it in the tpch.ini file.

Analysing the Results

The tpch.py benchmark driver collects timings as the benchmark runs in a local PostgreSQL database named tpch-results. This database is local to each loader instance, so an extra step is needed to collect the results centrally.

Here's the results of running a series of local benchmark as done previously, where the name of the run as been assigned as sleep_slowly:

tpch-results# select system, schedule, job, duration, count
                from results
               where run = 'sleep_slowly';

 system │ schedule │          job          │    duration     │ count 
════════╪══════════╪═══════════════════════╪═════════════════╪═══════
 pgsql  │ quick    │ drop tables           │ @ 0.060019 secs │     1
 pgsql  │ quick    │ create tables         │ @ 0.042395 secs │     1
 pgsql  │ quick    │ initdb                │ @ 6.12964 secs  │     1
 pgsql  │ quick    │ schema/tpch-pkeys.sql │ @ 0.144589 secs │     1
 pgsql  │ quick    │ schema/tpch-index.sql │ @ 0.331428 secs │     1
 pgsql  │ quick    │ schema/tpch-fkeys.sql │ @ 0.11635 secs  │     1
 pgsql  │ quick    │ stream                │ @ 6.992539 secs │   196
(7 rows)

The tpch.py driver is meant to be called on a remote machine, the loader node, and the DSN is then passed in the environment by the main Makefile.

Benchmark Schedule and Jobs

This benchmark is based on TPC-H and meant to incrementally reach the Scale Factor, by implementing the data load in multiple phases. It is possible to configure several load phases in the file tpch.ini, following this example:

[scale]
cpu = 16
factor = 300
children = 100

[schedule]
full     = initdb, stream, phase1, stream
quick    = initdb, stream
stream   = stream

[initdb]
type  = load
steps = 1..10

[phase1]
type  = load
steps = 11..30

[phase2]
type  = load
steps = 31..100
cpu   = 2

[stream]
type     = stream
queries  = 1 4 6 12
duration = 5

With such a setup, the target database size is 300 GB (the TPC-H scale factor unit is roughly 1GB), and we can load the data using the three phases arbitrarily named initdb, phase1 and phase2.

To run the full schedule on all systems in parallel, it's possible to run the following command:

make -j 4 SCHEDULE=full make benchmark

This command then connects to each of the controller nodes for the tested systems, and runs the following command there:

DSN=... ./tpch.py benchmark <system name> --name <name> --schedule full

The system name is going to be replaced by each of the systems considered in this benchmark, currently that's _citus, pgsql, rds, and aurora. The name of the benchmark is computed once on the controller node (the one where you're tying the make commands) then the same name is used on every node. That allows to then easily merge all the tracked timings to further analyze them as part of the same benchmark run and configuration.

Note: the initdb job is an hard-coded special job name that does several extra things.

Initialiazing the TPC-H database

The initdb job consists of the following actions:

create the schema
install the cardinalities view, a simple COUNT wrapper
load the configured steps of data, see next section
install the SQL constraints: primary and foreign keys, and indexes
vacuum analyze the resulting database

It is possible to select a schema variant thanks to the --kind option on the tpch.py load command, currently the pgsql (default) and citus ones are provided.

Loading Phase Specifications

The tpch.ini file expects a Load Phase Specification for each option in the [load] section. Make sure that the section contains the initdb option, as seen above.

Then, each option is an arbitrary name of a loading phase. Each phase consists of a set of steps as per the TPC-H specifications. The steps must be a continuous range.

[phase1]
type  = load
steps = 11..30

In this exemple, the phase1 phase consists of the 20 steps from 11 to 30.

The STEP numbers are used as the -S argument to the dbgen program. Of course, for such a setup to make any sense the steps should all be within the range 1..children, with children being an option of the [scale] section in the same tpch.ini.

Concurrency

The load phases and the streams are done concurrently with a Python pool of processes. The [scale] option cpu is used to configure how many process are being started on the coordinator.

With the previous setup where cpu = 16, the phase1 load phase of steps 11 to 30 included in going to be ran on the pool of 16 worker process, one per CPUs. As soon as a worker process is done with a step, the driver starts another process load one of the remaining steps, until all the steps are loaded.

Streaming Queries Concurrently

The stream testing is limited in time, and we measure how much work could be done in a specified amount of time by the different systems in competition.

[stream]
type     = stream
queries  = 1 4 6 12
duration = 600

In this setup, a STREAM consists of running the TPC-H queries 1, then 4, then 6, then 12, in this order, one after the other. As many streams as we have CPU in the [scale] section are started concurrently, and as soon as a stream is done, it is replaced by another one.

Each query execution time is registered, and new streams are started for as long as duration allows. The duration is read as a number of seconds. After having started new streams during duration seconds, then the process pool waits until the currently running streams are all done.

Here's a sample output of a query stream ran for 5s on a single CPU:

$ DSN=postgresql:///tpch ./tpch.py benchmark pgsql --schedule stream
2018-02-12 11:34:36,730 INFO pgsql: starting benchmark solve_sometimes
2018-02-12 11:34:36,739 INFO pgsql: starting schedule stream
2018-02-12 11:34:36,739 INFO pgsql: Running TPCH with 16 CPUs for 5s, stream 1 4 6 12
2018-02-12 11:34:37,830 INFO pgsql: 1 query streams executed in 1.07669s
2018-02-12 11:34:39,396 INFO pgsql: 17 query streams executed in 2.6434s
2018-02-12 11:34:40,400 INFO pgsql: 17 query streams executed in 3.64771s
2018-02-12 11:34:41,901 INFO pgsql: 33 query streams executed in 5.14833s
2018-02-12 11:34:43,581 INFO pgsql: executed 49 streams (196 queries) in 6.81578s, using 16 CPU

Usage

The main Makefile targets are listed with make help. To test several systems in parallel, use e.g. make -j2 stream.

$ make help
TPC-H benchmark for PostgreSQL and Citus Data

Use make to drive the benchmark, with the following targets:

  help           this help message
  infra          create AWS test infrastructure
  terminate      destroy AWS test infrastructure
  drop           drop TPC-H test tables

  benchmark      bench-citus bench-pgsql bench-rds bench-aurora
  bench-citus    run given SCHEDULE on the citus system
  bench-pgsql    run given SCHEDULE on the pgsql system
  bench-rds      run given SCHEDULE on the rds system
  bench-aurora   run given SCHEDULE on the aurora system

  tail-f         see logs from currently running benchmark
  fetch-logs     fetch logs in ./logs/YYYYMMDD_name/system.log
  dump-results   dump results in ./logs/YYYYMMDD_name/system.dump
  merge-results  merge the results into the RESULTS_DSN database

  cardinalities  run SELECT count(*) on all the tables

Benchmark setup and roles

This benchmark uses 3 kinds of services, that needs to be connected either using SSH or the PostgreSQL protocol directly:

a controller node, typically localhost, your usual laptop
several loader nodes
several database clusters or instances

The Controller

That's where you type your commands from. The controller mainly uses the following files:

control.py
infra.ini
infra.py and the infra Python module

The infra.ini setup registers AWS configuration parameters (such as the region, availability zone, VPC, subnets, security groups, keyname) and the setup of each kind of machine that is going to be used.

When running the benchmark, the infrastructure consists of:

a controller node
a loader node per database system being benchmarked
the database services to test

The infra.py file knows how to start EC2 instances, RDS database instances and Aurora PostgreSQL clusters with a single database instance. To test the Citus and PostgreSQL core instances, you need to manage the services manually and then register the Database Source Name (or DSN) used to connect to the running service.

The control.py file is using the infra API directly, so that you don't need to worry about the details in infra.py. In case you have to, though, you can begin with:

$ ./infra.py ec2 list
$ ./infra.py rds list
$ ./infra.py aurora list

The Loader

While the loader is meant to be remotely controlled by the controller, it is also made easy to interact with directly. The main controller Makefile uses targets that ssh into the loader and run command there. A typical action from the controller will use something that looks like the following:

ssh -l ec2-user DSN=postgresql://.../db make -f Makefile.loader target

The loader mainly uses the following files:

Makefile.loader
tpch-pg PostgreSQL port of the TPC-H sources (dbgen and qgen)
tpch.ini
tpch.py and the tpch Python module
schema/tracking.sql and a PostgreSQL database where to register the stats

The main entry point of the loader is the tpch.py command, which implements its action by means of calling into the Makefile.loader file, with arguments made available on the command line. A typical command run from the loader looks like the following:

DSN=postgresql:///tpch ./tpch.py benchmark pgsql --schedule SCHEDULE

The tpch.py tool then reads its tpch.ini configuration file that contains the benchmarking setup and applies it by calling into the Makefile.loader with the right arguments passed in the command line, such as in the following example:

make -f Makefile.loader SF=30000 C=300 S=1 load
make -f Makefile.loader STREAM='1 3 6 12' stream

Such a command is expected to be called from the tpch.py command line, not interactively. The reason why the Python driver exists is so that we can easily run 16 such commands, with S varying from 1 to 16, on 16 different CPU cores concurrently. It's fair to consider the tcpy.py program as a concurrent driver for the benchmark suite.

The DSN environment variable should be set when calling tpch.py. The Python script does nothing with it, but then Makefile.loader is using it to know how to contact the database system being tested. Setting the DSN is the job of the main controller scripts.

The databases

The infra.py script knows how to create and terminate both RDS and Aurora PostgreSQL instances on AWS. The details of the instances are setup using the following INI syntax:

[rds]
name = tpch
size = 4500
iops = 10000
class = db.r4.2xlarge
stype = io1
pgversion = 9.6.5

[aurora]
name = tpch
class = db.r4.2xlarge
stype = io1

It's then possible to retrieve the DSN to be used to connect to a database created thanks to the infra driver's command:

$ ./infra.py rds dsn --json aws.out/db.rds.json

This might result in an error, so in order to wait until the service is available, it's possible to use the wait command:

$ ./infra.py rds dsn --json aws.out/db.rds.json

That's what the main Makefile is using.

Implementation Notes

Infrastructure

We use the infra.py script in this repository to spin-off some AWS instances of EC2 virtual machines and RDS databases.

Getting the numbers

A benchmark consists of the following activities.

First, load the data set:

create the database model
load an initial set of data
add primary keys, foreign keys and extra indexes to the model
vacuum verbose analyze it all

Now that we have a data set, run the queries:

start N+1 concurrent sessions
the first one runs the database update scripts
each other one is doing a stream of TPCH analytical Queries
measure time spent on each query in each session, and to run each stream

Now, enlarge the data set and repeat step 2 before

generate more data with DBGEN

Scale Factors

The interesting test is going to be with a 30TB database size when completely loaded, and with the following steps:

100 GB
300 GB
1 TB
3 TB
10 TB
30 TB

Here's how to use DBGEN to achieve that:

./dbgen -s 30000 -C 300 -S 1 -D -n DSN

This command produces the first 100GB of data for a 30TB Scale Factor of a test.
./dbgen -s 30000 -C 300 -U 1 -D -n DSN

This command produces the set of updates that go with the first 100GB of data.
DSS_QUERY=../queries/ ./qgen -s 100 1

This command produces Q1 with parameters adapted to SF=100, and the SQL text is found on stdout, ready to be sent to PostgreSQL.

Notes

Autora claims the following on their main page, Amazon Aurora Product Details says:

The PostgreSQL-compatible edition of Aurora delivers up to 3X the throughput of standard PostgreSQL running on the same hardware

Some reading and viewing/listening for background information about Aurora:

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide//AuroraPostgreSQL.BestPractices.html

TCP keepalives and Admission Resource Control?
https://www.toadworld.com/platforms/postgres/b/weblog/archive/2017/06/07/a-first-look-at-amazon-aurora-with-postgresql-compatibility-benefits-and-drawbacks-part-v

So the 3X throughput actually seems to be obtained when using Sysbench, a benchmarking suite that has been known to be good for MySQL and put PostgreSQL into bad lights.
https://www.youtube.com/watch?v=xrMbzHdPLKM&feature=youtu.be&t=30m20s

No checkpoints, no WAL, 4-out-of-6 writes on distributed block storage, or something like that.

dimitri/tpch-citus