data-warehouse-project: A Shell repository from mohammadzainabbas

INFO-H-419: Data Warehouse' Project ✨👨🏻‍💻

Introduction
- TPC-DS
- Spark SQL
Setup
Benchmark
Result

1. Introduction

1.1 TPC-DS

TPC DS is a standard used for benchmarking relational databases with a power test. The official TPC-DS tools can be found at tpc.org. We have used v2.10.0 with some modifications done by gregrahn/tpcds-kit.

1.2 Spark SQL

For TPC DS benchmarking, we have used Apache spark's module (Spark SQL) for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. You can find out more from their official doc.

2. Setup

Before starting, you need to setup your machine first. Please follow the below mentioned guide to setup Spark on Mac machine.

Note: For Linux, you might need to manually install apache-spark and setup your python enviornment

2.1 via Setup script

We have created a setup script which will setup brew, apache-spark and conda enviornment. If you are on Mac machine, you can run the following commands:

git clone https://github.com/mohammadzainabbas/data-warehouse-project.git
cd data-warehouse-project && sh scripts/setup.sh

After running the above mentioned commands, your file structure should look something like this. (you need to go to parent directory first cd ..)

├── data-warehouse-project
│   ├── schema
│   ├── scripts
└── tpcds-kit
    ├── query_templates
    ├── query_variants
    ├── specification
    └── tools

For Mac users, that's all you needed to do. Now, you can skip to benchmarking part.

For Linux users, please follow below mentioned steps for setup.

2.2 Apache Spark

If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark.

2.3 PySpark

We recommend you to install conda on your machine. You can setup conda from here

After you have conda, create new enviornment via:

conda create -n spark_env python=3.8 pandas

Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)

Activate your enviornment:

conda activate spark_env

Now, you need to install pyspark:

pip install pyspark

If you are using bash:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc

And if you are using zsh:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc

2.4 TPC DS

For Linux users, you need to manually install tpcds-kit from gregrahn/tpcds-kit

Make sure you have the same file structure as mentioned above

Benchmark

Run the following command to benchmark for scale N

sh scripts/benchmark.sh -scale N

Here, N is the scale factor against which you want to benchmark.

If you want to run for multiple benchmark scales

Modify the benchmark_scales variable to your scale values
Run the below command:

sh scripts/benchmark_all.sh

This will run for all the scales.

Result

The benchmarking results can be found under benchmark folder. For example, when you run sh benchmark.sh -scale 1 (i.e: benchmarking for 1 GB scale), you will find two files in benchmark directory:

benchmark_timings_1gb.csv contains the average time (in seconds) for all queries.
benchmark_timings_1gb.pickle contains all the timings for each query (usually 5 per query).

mohammadzainabbas/data-warehouse-project