/data-warehouse-project

Data Warehouse Project - TPC-DS benchmarking on Spark SQL 👨🏻‍💻

Primary LanguageShellMIT LicenseMIT

INFO-H-419: Data Warehouse' Project ✨👨🏻‍💻

Open in VS Code

Table of contents


TPC DS is a standard used for benchmarking relational databases with a power test. The official TPC-DS tools can be found at tpc.org. We have used v2.10.0 with some modifications done by gregrahn/tpcds-kit.

For TPC DS benchmarking, we have used Apache spark's module (Spark SQL) for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. You can find out more from their official doc.

Before starting, you need to setup your machine first. Please follow the below mentioned guide to setup Spark on Mac machine.

Note: For Linux, you might need to manually install apache-spark and setup your python enviornment

We have created a setup script which will setup brew, apache-spark and conda enviornment. If you are on Mac machine, you can run the following commands:

git clone https://github.com/mohammadzainabbas/data-warehouse-project.git
cd data-warehouse-project && sh scripts/setup.sh

After running the above mentioned commands, your file structure should look something like this. (you need to go to parent directory first cd ..)

├── data-warehouse-project
│   ├── schema
│   ├── scripts
└── tpcds-kit
    ├── query_templates
    ├── query_variants
    ├── specification
    └── tools

For Mac users, that's all you needed to do. Now, you can skip to benchmarking part.

For Linux users, please follow below mentioned steps for setup.

If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark.

We recommend you to install conda on your machine. You can setup conda from here

After you have conda, create new enviornment via:

conda create -n spark_env python=3.8 pandas

Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)

Activate your enviornment:

conda activate spark_env

Now, you need to install pyspark:

pip install pyspark

If you are using bash:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc

And if you are using zsh:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc

For Linux users, you need to manually install tpcds-kit from gregrahn/tpcds-kit

Make sure you have the same file structure as mentioned above

Run the following command to benchmark for scale N

sh scripts/benchmark.sh -scale N

Here, N is the scale factor against which you want to benchmark.

If you want to run for multiple benchmark scales

  1. Modify the benchmark_scales variable to your scale values
  2. Run the below command:
sh scripts/benchmark_all.sh

This will run for all the scales.

The benchmarking results can be found under benchmark folder. For example, when you run sh benchmark.sh -scale 1 (i.e: benchmarking for 1 GB scale), you will find two files in benchmark directory:

  1. benchmark_timings_1gb.csv contains the average time (in seconds) for all queries.
  2. benchmark_timings_1gb.pickle contains all the timings for each query (usually 5 per query).