/s64da-benchmark-toolkit

Swarm64 DA Benchmark Toolkit

Primary LanguagePythonMIT LicenseMIT

Summary

This toolkit provides methods to execute the TPC-H, TPC-DS, and SSB benchmarks on:

  • PostgreSQL
  • EDB Postgres Advanced Server (EPAS)
  • PostgreSQL with Swarm64 DA
  • EPAS with Swarm64 DA

Important note: in order to guarantee compatibility between S64 DA and s64da-benchmark-toolkit, please check out the Git tag that corresponds to your version of S64 DA. For example, if your version of S64 DA is 5.1.0, clone this repository and run git checkout v5.1.0 in the the repository’s root folder before proceeding. For S64 DA versions 4.0.0 and below checkout v4.0.0_and_below.

Prerequisites

  • Python min. 3.6 and pip3
  • For TPC-DS only: Linux package recode
  • Install additional packages, for Python 3.6 eg. with: /usr/bin/python3.6 -m pip install -r requirements.txt
  • The psql PostgreSQL client
  • For loading the data, the database must be accessible with the user postgres or enterprisedb without password

Creating a Database and Loading Data

Load a database with a dataset. If the database does not exist, it will be created. If it does exist, it will be deleted and recreated.

./prepare_benchmark \
    --dsn postgresql://postgres@localhost/<target-db> \
    --benchmark <tpch|tpcds|ssb|htap> \
    --schema=<schema-to-deploy> \
    --scale-factor=<scale-factor-to-use>

For example in order to load tpch dataset using PostgreSQL with Swarm64 DA performance schema:

./prepare_benchmark \
    --dsn=postgresql://postgres@localhost:5432/example-database \
    --benchmark=tpch \
    --schema=s64da_performance \
    --scale-factor=1000

Required Parameters

Parameter Description
dsn The full DSN of the DB to connect to. DSN layout:
postgresql://<user>@<host>:<target-port>/<target-db>
The port is optional and the default is 5432. Example with port 5444 and use of EPAS:
--dsn postgresql://enterprisedb@localhost:5444/example-database
benchmark The benchmark to use: tpch, tpcds or ssb
schema The schema to deploy. Schemas are directories in the benchmarks/<benchmark>/schemas directory. See the table below for the supported schemas.
scale-factor The scale factor to use, such as 10, 100 or 1000.

Schema Parameter Values

Value Description
psql_native the standard PostgreSQL schema
s64da_native as above but with the S64 DA extension with its default feature set enabled
s64da_native_enhanced as above but with some of the S64 DA opt-in features enabled, such as columnstore index
s64da_performance schema that provides the best performance for S64 DA (includes removal of btree indexes, keys, and use of floating point)
*_partitioned_id_hashed schema like one of first four schemas but partitioning some tables using hash on main id column of the table
*_partitioned_date_week schema like one of first four schemas but partitioning tables with dates by weeks

Optional Parameters

Parameter Description
chunks Chunk large tables into smaller pieces during ingestion. Default: 10
max-jobs Limit the overall loading parallelism to this amount of jobs. Default: 8
check-diskspace-of-directory If flag is present, a disk space check on the passed storage directory will be performed prior to ingestion
data-dir The directory holding the data files to ingest from. Default: none
num-partitions The number of partitions for partitioned schemas. Default: none
start-date The data start date for HTAP benchmark

Depending on the scale factor you chose, it might take several hours for the script to finish. After the script creates the database, it loads the data, creates primary keys, foreign keys, and indices. Afterwards, it runs VACUUM and ANALYZE.

Runnning a Benchmark

Start a benchmark:

./run_benchmark \
    --dsn postgresql://postgres@localhost/<target-db> \
    [--benchmark] <tpch|tpcds|ssb|htap> \
    <optional benchmark-specific arguments>

This runs the benchmark with the default runtime restriction per query. Some benchmarks support a --timeout parameter to adjust this limit.

Note: The --benchmark parameter has been deprecated and is ignored. The name of the benchmark should directly follow the specification of --dsn.

Required Parameters

Parameter Description
dsn The full DSN of the DB to connect to. DSN layout:
postgresql://<user>@<host>:<target-port>/<target-db>
The port is optional and the default is 5432. Example with port 5444 and use of EPAS:
--dsn postgresql://enterprisedb@localhost:5444/example-database
  Name of the the benchmark to use: tpch, tpcds, ssb, or htap

Note: if you enable correctness checks with the --check-correctness flag, the parameter --scale-factor is required.

Optional Parameters

Parameter Description
use-server-side-cursors Use server-side cursors for executing the queries.

The optional parameters differ by benchmark. The ones for TPC-H, TPC-DS, and SSB are described in this section. The parameters supported by HTAP are described in a separate section below.

Parameter Description
config Path to additional YAML configuration file
timeout The maximum time a query may run, such as 30min
streams The number of parallel query streams, can be used for throughput tests.
steam-offset With which stream to start if running multiple streams. Defaults: 1
netdata-output-file File to write Netdata stats to. Requires netdata key to be present in configuration. Default: none
output How the results should be formatted. Multiple options possible. Default: none
csv-file Path to the CSV file for output if csv output is selected. Default: results.csv in the current directory.
check-correctness Compares each query result with pre-recorded results and stores them in the query_results directory. Requires scale-factor to be set.
scale-factor Scale factor for the correctness comparison. Default: none
explain-analyze Whether to run EXPLAIN ANALYZE. Query plans will be saved into the plans directory.

Test Parameterization with Additional YAML Configuration

You can modify the existing configuration files located under the configs directory. By default, the toolkit loads loads the respective default.yaml configuration file for each benchmark. Alternatively, you can create an additional configuration file to control test execution more granularly. An example YAML file for the TPC-H benchmark might look as follows:

timeout: 30min
ignore:
  - 18
  - 20
  - 21

dbconfig:
  max_parallel_workers: 96
  max_parallel_workers_per_gather: 32

To use this file, pass the --config=<path-to-file> argument to the test executor. In this example, the query timeout is set to 30min and queries 18, 20, and 21 will not be run. Additionally, the database parameters max_parallel_workers and max_parallel_workers_per_gather will be set to 96 and 32, respectively.

In order to perform changes to the database configuration, the user needs to have superuser privileges. Any change to the database configuration is applied to the whole database system before the benchmark starts. If any change was applied manually, the whole database configuration will be reset to that in the PostgreSQL configuration file after the benchmark completes.

Some options can be passed on the command line and in a config file. Any such option passed on the command line will override the value set in the config file.

Note: This feature is not supported by HTAP benchmark.

HTAP Benchmark

A mixed workload benchmark implementation using a hybrid TPC-C/TPC-H schema is available in benchmarks/htap. It draws inspiration from sysbench-tpcc, CHbenCHmark, and HTAPBench.

Data preparation is identical to the other benchmarks (see "Creating a database and loading data" above).

The HTAP benchmark requires command line arguments that differ from the ones described above. The --dsn argument is shared with the other benchmarks and must be provided. The --benchmark argument is not used, instead the name htap must be provided directly after the --dsn argument. To run an HTAP benchmark with 4 OLTP workers and 2 OLAP workers for 30 minutes, run the folowing:

./run_benchmark \
    --dsn postgresql://postgres@localhost/htap
    [--benchmark] htap \
    --oltp-workers 4 \
    --olap-workers 2 \
    --duration 1800

Required Parameters

Parameter Description
dsn The full DSN of the DB to connect to. DSN layout:
postgresql://<user>@<host>:<target-port>/<target-db>
The port is optional and the default is 5432. Example with port 5444 and use of EPAS:
--dsn postgresql://enterprisedb@localhost:5444/example-database
htap Enables parsing of the command line arguments below, do not prefix with --.

Optional Parameters

Parameter Description
oltp-workers The number of OLTP workers executing TPC-C transactions (i.e. simulated clients), default: 1
olap-workers The number of OLAP workers running modified TPC-H queries, default: 1.
duration The number of seconds the benchmark should run for, default: 60 seconds
olap-timeout Timeout for OLAP queries in seconds, default: 900
dry-run Only generate transactions and queries but don't send them to the DB. Can be useful for measuring script throughput.
monitoring-interval Number of seconds to wait between updates of the monitoring display, default: 1
stats-dsn The DSN to use for collecting statistics into a database. Not defining it will disable statistics collection.

Monitoring

During a benchmark run the HTAP benchmark presents you with the following monitoring screen. This requires a VT100 compatible terminal emulator.

Detected scale factor: 1                                 <- scale factor, detected by counting the number of warehouses
Database statistics collection is disabled.              <- this will be shown if you didn't provide a `stats-dsn`
OK  -> Total TX:         87 | Current rate:   58.0 tps   <- the current transaction rate (tansactions per second)
ERR -> Total TX:          1 | Current rate:    0.0 tps   <- the current error rate (failed transactions per second)

Stream   |    1      |    2      |                       <- one column per OLAP stream
----------------------------------
Query  1 |           |           |                       <- The state of each query that was
Query  2 |      0.43 |           |                          recently run or is running currently.
Query  3 |           |      0.72 |                          Also shows when a query timed out or
Query  4 |           |           |                          caused an error in the database.
Query  5 |           |           |                          For finished queries the runtime is
Query  6 |      0.07 |           |                          displayed.
Query  7 |           |           |
Query  8 |           |           |
Query  9 |      0.63 |           |
Query 10 |           |           |
Query 11 |           |           |
Query 12 |           |           |
Query 13 |           |           |
Query 14 |      0.25 |           |
Query 15 |           |           |
Query 16 |           |           |
Query 17 |  Running  |           |
Query 18 |           |  Running  |
Query 19 |           |           |
Query 20 |      0.45 |           |
Query 21 |           |      0.74 |
Query 22 |           |           |

Elapsed: 2 seconds

Testing

For testing, install the test requirements,

/usr/bin/python3.6 -m pip install -r requirements-test.txt

and run python -m pytest tests. Some benchmark modules provide their own tests. To run, for example the test for the HTAP benchmark, execute python -m pytest benchmarks/htap/tests.