This is the decision support (DS) benchmark derived from TPC-DS. This repo contains automation of running the DS benchmark against an existing Greenplum cluster.
- VMware Greenplum
4.3.x
,5.x
,6.x
,7.x
- Open Source Greenplum Databases
5.x
,6.x
TPC has published the following TPC-DS standards over time:
TPC-DS Benchmark Version | Published Date | Standard Specification |
---|---|---|
3.2.0 (latest) | 2021/06/15 | http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf |
2.1.0 | 2015/11/12 | http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.1.0.pdf |
1.3.1 (earliest) | 2015/02/19 | http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v1.3.1.pdf |
These are the combined versions of TPC-DS and Greenplum:
DS Benchmark Version | TPC-DS Benchmark Version | Greenplum Version |
---|---|---|
3.0.0 | 2.1.0 | 4.3.x, 5.x, 6.x |
2.2.x | 2.1.0 | 4.3.x, 5.x |
2.x | 2.1.0 | 4.3.x |
- A running Greenplum Database with
${ADMIN_USER}
access(defaultgpadmin
) ${PGDATABASE}
database(defaultgpadmin
) is createdroot
access on the master nodemdw
for installing dependenciesssh
connections betweenmdw
and the segment nodessdw1..n
User other than gpadmin can also execute this script provided ADMIN_USER
should have equivalent access privilege of gpadmin. eg.
- Access privilege on gpdata directory and sub-directory on each node
- Should have read access to
TPC-DS/log
folder across nodes ADMIN_USER
database is created- Proper authentications set up in pg_hba.conf
All the following examples are using standard host name convention of Greenplum using mdw
for master node, and sdw1..n
for the segment nodes.
Install the dependencies on mdw
for compiling the dsdgen
(data generation) and dsqgen
(query generation).
ssh root@mdw
yum -y install gcc make
Note: If you are using Photon OS, then you need:
yum -y install build-essential
The original source code is from http://tpc.org/tpc_documents_current_versions/current_specifications5.asp.
Visit the repo at https://github.com/pivotal/TPC-DS/releases and download the tarball to the mdw
node.
ssh ${ADMIN_USER}@mdw
curl -LO https://github.com/pivotal/TPC-DS/archive/refs/tags/v3.7.0.tar.gz
tar xzf v3.7.0.tar.gz
mv TPC-DS-3.7.0 TPC-DS
NOTE: default ADMIN_USER
is assumed to be gpadmin
To run the benchmark, login as ${ADMIN_USER}
on mdw
:
ssh ${ADMIN_USER}@mdw
cd ~/TPC-DS
./tpcds.sh
NOTE: default ADMIN_USER
is assumed to be gpadmin
By default, it will run a scale 1 (1G) and with 2 concurrent users, from data generation to score computation.
By changing the tpcds_variables.sh
, we can control how this benchmark will run.
This is the default example at tpcds_variables.sh
# environment options
ADMIN_USER=`whoami`
BENCH_ROLE="dsbench"
PGOPTIONS=""
PGDATABASE=`whoami`
# benchmark options
GEN_DATA_SCALE="1"
MULTI_USER_COUNT="2"
RNGSEED="1"
HEAP_ONLY="false"
# step options
RUN_COMPILE_TPCDS="true"
RUN_GEN_DATA="true"
RUN_INIT="true"
RUN_DDL="true"
RUN_LOAD="true"
RUN_SQL="true"
RUN_SINGLE_USER_REPORTS="true"
RUN_MULTI_USER="true"
RUN_MULTI_USER_REPORTS="true"
RUN_SCORE="true"
# misc options
SINGLE_USER_ITERATIONS="1"
EXPLAIN_ANALYZE="false"
RANDOM_DISTRIBUTION="false"
tpcds.sh
will validate existence of those variables.
# environment options
ADMIN_USER="gpadmin"
These are the setup related variables:
-
ADMIN_USER
: defaultgpadmin
. It is the default database administrator account, as well as the user accessible to allmdw
andsdw1..n
machines.Note: The benchmark related files for each segment node are located in the segment's
${PGDATA}/dsbenchmark
directory. If there isn't enough space in this directory in each segment, you can create a symbolic link to a drive location that does have enough space.
In most cases, we just leave them to the default.
export PGPORT="6543"
# Add additional PostgreSQL refer:
# https://www.postgresql.org/docs/current/libpq-envars.html
export PGOPTIONS=""
export PGDATABASE="${ADMIN_USER}"
TPC-DS uses psql
command, which interally uses libpq
, to connect to the database.
libpq
provides environment variables that can override the default connection parameter values, which will be used by PQconnectdb, PQsetdbLogin and PQsetdb if no value is directly specified by the calling code.
# benchmark options
GEN_DATA_SCALE="1"
MULTI_USER_COUNT="2"
RNGSEED="1"
HEAP_ONLY="false"
These are the benchmark controlling variables:
GEN_DATA_SCALE
: default1
. Scale 1 is 1G.MULTI_USER_COUNT
: default2
. It's also usually referred asCU
, i.e. concurrent user. It controls how many concurrent streams to run during the throughput run.RNGSEED
: default1
. This will generate deterministic data and queries, to reduce variance between runs due to data and query variations. Change this value to generate a different distribution to avoid over-tuning for a specific distribution.HEAP_ONLY
: defaultfalse
. Whenfalse
, the benchmark will run withappend-optimized
table storage type. Whentrue
, the benchmark will run with heap table storage type.
If evaluating Greenplum cluster across different platforms, we recommend to change this section to 3TB with 5CU:
# benchmark options
MULTI_USER_COUNT="5"
GEN_DATA_SCALE="3000"
# step options
RUN_COMPILE_TPCDS="true"
RUN_GEN_DATA="true"
RUN_INIT="true"
RUN_DDL="true"
RUN_LOAD="true"
RUN_SQL="true"
RUN_SINGLE_USER_REPORT="true"
RUN_MULTI_USER="true"
RUN_MULTI_USER_REPORTS="true"
RUN_SCORE="true"
There are multiple steps running the benchmark and controlled by these variables:
RUN_COMPILE_TPCDS
: defaulttrue
. It will compile thedsdgen
anddsqgen
. Usually we only want to compile those binaries once. In the rerun, just set this value tofalse
.RUN_GEN_DATA
: defaulttrue
. It will use thedsdgen
compiled above to generate the flat files for the benchmark. The flat files are generated in parallel on all segment nodes. Those files are stored under${PGDATA}/dsbenchmark
directory. In the rerun, just set this value tofalse
.RUN_INIT
: defaulttrue
. It will setup the GUCs for the Greenplum as well as remember the segment configurations. It's only required if the Greenplum cluster is reconfigured. It can be alwaystrue
to ensure proper Greenplum cluster configuration. In the rerun, just set this value tofalse
.RUN_DDL
: defaulttrue
. It will recreate all the schemas and tables (including external tables for loading). If you want to keep the data and just rerun the queries, please set this value tofalse
, otherwise all the existing loaded data will be gone.RUN_LOAD
: defaulttrue
. It will load data from flat files into tables. After the load, the statistics will be computed in this step. If you just want to rerun the queries, please set this value tofalse
.RUN_SQL
: defaulttrue
. It will run the power test of the benchmark.RUN_SINGLE_USER_REPORTS
: defaulttrue
. It will upload the results to the Greenplum database${PGDATABASE}
under schematpcds_reports
. These tables are required later on in theRUN_SCORE
step. Recommend to keep ittrue
if above step ofRUN_SQL
istrue
.RUN_MULTI_USER
: defaulttrue
. It will run the throughput run of the benchmark. Before running the queries, multiple streams will be generated by thedsqgen
.dsqgen
will sample the database to find proper filters. For very large database and a lot of streams, this process can take a long time (hours) to just generate the queries.RUN_MULTI_USER_REPORTS
: defaulttrue
. It will upload the results to the Greenplum database${PGDATABASE}
under schematpcds_reports
. Recommend to keep ittrue
if above step ofRUN_MULTI_USER
istrue
.RUN_SCORE
: defaulttrue
. It will query the results fromtpcds_reports
and compute theQphDS
based on supported benchmark standard. Recommend to keep ittrue
if you want to see the final score of the run.
If any above variable is missing or invalid, the script will abort and show the missing or invalid variable name.
WARNING: Now TPC-DS does not rely on the log folder to run or skip the steps. It will only run the steps that are specified explicitly as true
in the tpcds_variables.sh
. If any necessary step is speficied as false
but has never been executed before, the script will abort when it tries to access something that does not exist in the database or under the directory.
# misc options
EXPLAIN_ANALYZE="false"
RANDOM_DISTRIBUTION="false"
SINGLE_USER_ITERATIONS="1"
These are miscellaneous controlling variables:
EXPLAIN_ANALYZE
: defaultfalse
. If you set totrue
, you can have the queries execute withEXPLAIN ANALYZE
in order to see exactly the query plan used, the cost, the memory used, etc. This option is for debugging purpose only, since collecting those query statistics will disturb the benchmark.RANDOM_DISTRIBUTION
: defaultfalse
. If you set totrue
, the fact tables are distributed randomly other than following a pre-defined distribution column.SINGLE_USER_ITERATION
: default1
. This controls how many times the power test will run. During the final score computation, the minimal/fastest query elapsed time of multiple runs will be used. This can be used to ensure the power test is in awarm
run environment.
Table storage is defined in functions.sh
and is configured for optimal performance.
get_version()
function defines different storage options for different scale of the benchmark.
SMALL_STORAGE
: All the dimension tablesMEDIUM_STORAGE
:catalog_returns
andstore_returns
LARGE_STORAGE
:catalog_sales
,inventory
,store_sales
, andweb_sales
Example of running the benchmark as root
as a background process:
nohup ./tpcds.sh &> tpcds.log &
According to https://www.tpc.org/tpcds/presentations/the_making_of_tpcds.pdf, but not in the TPC-DS specification, figure 12 indicates the following minimum query streams by the scale factor.
Scale Factor | Minimum Number of Streams |
---|---|
100 | 3 |
300 | 5 |
1,000 | 7 |
3,000 | 9 |
10,000 | 11 |
30,000 | 13 |
100,000 | 15 |
Old:
and (cast('2000-02-28' as date) + 30 days)
New:
and (cast('2000-02-28' as date) + '30 days'::interval)
This was done on queries: 5, 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, and 98.
Old:
select
sum(ss_net_profit) as total_sum
,s_state
,s_county
,grouping(s_state)+grouping(s_county) as lochierarchy
,rank() over (
partition by grouping(s_state)+grouping(s_county),
case when grouping(s_county) = 0 then s_state end
order by sum(ss_net_profit) desc) as rank_within_parent
from
store_sales
,date_dim d1
,store
where
d1.d_month_seq between 1212 and 1212+11
and d1.d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
and s_state in
( select s_state
from (select s_state as s_state,
rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
from store_sales, store, date_dim
where d_month_seq between 1212 and 1212+11
and d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
group by s_state
) tmp1
where ranking <= 5
)
group by rollup(s_state,s_county)
order by
lochierarchy desc
,case when lochierarchy = 0 then s_state end
,rank_within_parent
limit 100;
New:
select * from ( --new
select
sum(ss_net_profit) as total_sum
,s_state
,s_county
,grouping(s_state)+grouping(s_county) as lochierarchy
,rank() over (
partition by grouping(s_state)+grouping(s_county),
case when grouping(s_county) = 0 then s_state end
order by sum(ss_net_profit) desc) as rank_within_parent
from
store_sales
,date_dim d1
,store
where
d1.d_month_seq between 1212 and 1212+11
and d1.d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
and s_state in
( select s_state
from (select s_state as s_state,
rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
from store_sales, store, date_dim
where d_month_seq between 1212 and 1212+11
and d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
group by s_state
) tmp1
where ranking <= 5
)
group by rollup(s_state,s_county)
) AS sub --new
order by
lochierarchy desc
,case when lochierarchy = 0 then s_state end
,rank_within_parent
limit 100;
This was done on queries: 36 and 70.
In these cases, the common table expression used aliased columns but the dynamic filters included both the alias name as well as the original name. Referencing the original column name instead of the alias causes the query parser to not find the column.
This was done on query 86.
This was done on queries: 2, 14, and 23.
For the larger tests (e.g. 15TB), a few of the TPC-DS queries can output a very large number of rows which are just discarded.
This was done on queries: 64, 34, and 71.
For larger tests (50TB and up), data load will fail due to out of range numbers.
This is done on loading of web_sales
, and web_returns
.
shellcheck
: https://github.com/koalaman/shellcheckshfmt
: https://github.com/mvdan/sh
make lint