Tpcds-dbt-duckdb

I created this repository for a blogpost, I wrote, explaining why Dbt and Duckdb are a good alternative for Spark. The blogpost can be found here. This repository contains the tpcds queries inside a standard dbt project, which uses the dbt-duckdb adapter.

Prerequisites

Data

The data is generated using the Databricks toolkit together with the Databricks sql perf. The resulting jars are added to a spark docker container following the instructions provided in eks spark benchmark and the full setup can be seen in data/Dockerfile.

Generate data locally

We use dsdgen of the databricks toolkit for generating the data. An example on how to use the resulting docker image:

docker build -f data/Dockerfile -t tpcds-benchmark .
docker run -v /tmp/tpcds:/var/data -it sql-benchmark /opt/spark/bin/spark-submit --master "local[*]" --name somename \
       --deploy-mode client --class com.amazonaws.eks.tpcds.DataGeneration local:///opt/spark/work-dir/eks-spark-benchmark-assembly-1.0.jar \ 
       /var/data /opt/tpcds-kit/tools parquet 1 10 false false true # These are the application arguments required by the DataGeneration class: data location, path to tpcds toolkit, data format, scale factor, number partitions, create partitioned fact tables, shuffle to get partitions into single files, set logging to WARN

The previous command generates all input data as parquet files with a scale factor of 1 and 10 partitions (For the benchmark we used 100 and 100 as values). If you want to generate more date, you should change the corresponding parameters. The data is written to /var/data in the docker container which is mounted under /tmp/tpcds.

Generate data on eks

The same Spark container can be used when generating data in eks. If you add a role to the pod, you can directly write data to a s3 path.

Execute the code

In order to execute the code locally, you need to install:

dbt
pyenv (recommended)

If you want to run the code remotely, this repository contains a dags folder in order to execute the code on Conveyor. The code in this repository is packaged in a Docker container starting from a dbt base image. The base image is publicly available and the content is described here.

If you want to run the benchmark yourself, this is relatively simple. The Airflow Dag code (in the dags folder) can be altered to run on any Airflow installations, by replacing the ConveyorContainerOperatorV2 to a KubernetesPodOperator

Queries

I copied all queries from duckdb tpcds benchmark and made them work with s3 input/output. The original duckdb queries can be found here

Tpc-ds results

We ran the benchmark for all queries on m.2xlarge machines, which have 8 vcpu and 32Gb of RAM and attached 100GB of disk storage. All except 5 queries return successfully. I need to investigate further why these 5 queries go OOM, even on larger instances.

Query	Time (s)
q01	9.55
q02	18.12
q03	11.46
q04	83.04
q05	42.64
q06	41.28
q07	21.72
q08	13.75
q09	59.95
q10	20.16
q11	47.93
q12	5.15
q13	29.36
q14	147.68
q15	11.69
q16	27.44
q17	18.74
q18	16.84
q19	OOM
q20	5.05
q21	8.73
q22	26.91
q23	OOM
q24	25.96
q25	21.31
q26	10.65
q27	42.43
q28	45.84
q29	15.92
q30	10.01
q31	31.35
q32	9.58
q33	21.29
q34	7.89
q35	13.10
q36	30.01
q37	12.73
q38	15.84
q39	14.14
q40	8.84
q41	1.16
q42	7.15
q43	8.14
q44	20.78
q45	6.02
q46	13.50
q47	53.80
q48	13.83
q49	35.11
q50	11.07
q51	43.87
q52	7.17
q53	8.68
q54	67.14
q55	6.87
q56	19.42
q57	27.39
q58	23.86
q59	29.74
q60	18.17
q61	22.46
q62	4.42
q63	8.64
q64	OOM
q65	29.27
q66	14.57
q67	521.49
q68	15.25
q69	14.55
q70	20.36
q71	22.41
q72	47.15
q73	7.76
q74	33.63
q75	53.12
q76	15.83
q77	29.11
q78	63.17
q79	13.92
q80	48.35
q81	8.38
q82	14.79
q83	8.15
q84	5.24
q85	12.43
q86	5.28
q87	17.78
q88	31.30
q89	9.75
q90	5.95
q91	5.09
q92	7.26
q93	15.25
q94	13.10
q95	OOM
q96	4.88
q97	20.31
q98	7.73
q99	7.05

datamindedbe/tpcds-dbt-duckdb