/tpch-spark

TPC-H queries in Apache Spark SQL using native DataFrames API

Primary LanguageCMIT LicenseMIT

tpch-spark

TPC-H queries implemented in Spark using the DataFrames API.

The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. — https://www.tpc.org/tpch


Getting started

Prerequisites: Apache Spark

tpch-spark requires that Spark is installed on your machine. You can download Spark from https://spark.apache.org/downloads.html. At a high level, to install Spark you have to:

# Step 1:
tar xvfz <the tgz file you downloaded>

# Step 2:
# [optionally move/rename the untarred directory wherever you want, say, $HOME/spark]

# Step 3:
export PATH=$PATH:$HOME/spark/bin
# or better yet, add the above to your bashrc (or equivalent) and source it.

A. Get the code

git clone https://github.com/ssavvides/tpch-spark
cd tpch-spark

B. Generate input data tables

Navigate to the data generator directory dbgen and build the data generator:

cd dbgen
make

This should generate an executable called dbgen. Use the -h flag to see the various options the tool offers.

./dbgen -h

The simplest case is running the dbgen executable with no flags.

./dbgen

The above generates tables with extension .tbl with scale 1 (default) for a total of roughly 1GB size across all tables.

$ ls -hl *.tbl
-rw-rw-r-- 1 savvas savvas  24M May 28 12:39 customer.tbl
-rw-rw-r-- 1 savvas savvas 725M May 28 12:39 lineitem.tbl
-rw-rw-r-- 1 savvas savvas 2.2K May 28 12:39 nation.tbl
-rw-rw-r-- 1 savvas savvas 164M May 28 12:39 orders.tbl
-rw-rw-r-- 1 savvas savvas 114M May 28 12:39 partsupp.tbl
-rw-rw-r-- 1 savvas savvas  24M May 28 12:39 part.tbl
-rw-rw-r-- 1 savvas savvas  389 May 28 12:39 region.tbl
-rw-rw-r-- 1 savvas savvas 1.4M May 28 12:39 supplier.tbl

For different size tables you can use the -s (scale) option. For example,

./dbgen -s 10

will generate roughly 10GB of input data.

Note that by default, dbgen uses a | as a column separator, and includes a | at the end of each entry.

$ cat region.tbl 
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
2|ASIA|ges. thinly even pinto beans ca|
3|EUROPE|ly final courts cajole furiously final excuse|
4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|

You can find the schemas of the generated tables in the TPC-H specification

C. Build tpch-spark

tpch-spark is written in Scala as a self-contained Spark application.

Use the provided sbt file to build tpch-spark as a spark application.

cd tpch-spark
sbt package

The above command will package the application into a jar file, e.g., ./target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar which you will be needing in the next step.

D. Run tpch-spark

You can run all TPC-H queries from Q01 to Q22 by running:

spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar

If you want to run a specific query you can use

spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar <query number>

where <query number> is the number of the query to run, i.e., 1, 2, ..., 22.

N.B.:

  • By default, tpch-spark will look for the input data files (the *.tbl files generated by dbgen) in "<current working directory>/dbgen". You can point to another location by setting the environment variable TPCH_INPUT_DATA_DIR.
  • By default, the query results will be stored in "${TPCH_INPUT_DATA_DIR}/output/{Q01, Q02, ...}, or to whatever location TPCH_QUERY_OUTPUT_DIR is set.
  • The execution times for each query run will be stored in a file with path "<current working directory>/tpch_execution_times.txt" or to whatever file path TPCH_EXECUTION_TIMES points to.

For example, to replace the default locations you can use:

export TPCH_INPUT_DATA_DIR="$HOME/tpch-data"
export TPCH_QUERY_OUTPUT_DIR="$HOME/tpch-results"
export TPCH_EXECUTION_TIMES="$HOME/tpch-times.txt"

Other Implementations

  1. Data generator (http://www.tpc.org/tpch/)
  2. TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
  3. TPC-H for PIG (https://github.com/ssavvides/tpch-pig)