/tpc-ds_big-query

Scripts to execute TPC - DS on Big Query

Primary LanguageTSQL

TPC - DS - Big Query

Credits: Most scripts have been referenced from Fivetran DW Benchmark and have been adapted to suit our particular usecase.

Steps:

  1. Move dsdgen to a GCS bucket to a specific location as mentioned in the bootstrap script
  2. Create a High CPU VM eg. 16vCPU
  3. Clone this repository
git clone $REPO_URL
  1. Give all script files executable permission
chmod +x *.sh
  1. Run bootstrap.sh
    1. This pulls dsdgen binary
    2. Installs Google Fuse; this is to mount GCS bucket as a local folder - More info
  2. Run data_gen.sh
    Usage:
    ./data_gen.sh $CPU $SCALE
    1. This is responsible for generating data
    2. $CPU denotes the amount of parallelism must be > 1
    3. $SCALE denotes the scale of data that needs to be generated
    4. This creates and mounts a GCS Bucket and writes data to it
    5. NOTE: Ensure that $CPU is close to number of CPUs in VM for efficient parallel generation
  3. Run load_data.sh
    Usage:
    ./load_data.sh $SCALE
    1. This is responsible of loading data in GCP buckets created in step 5 to BigQuery
    2. $SCALE denotes the scale of data that needs to be loaded to BigQuery
    3. Note: Before running this step ensure that data is generated and present in the appropriate GCS Bucket
  4. Run benchmark.sh
    Usage:
    ./benchmark.sh $SCALE
    1. This is responsible for running TPC-DS queries and measuring query execution time
    2. Generates a csv file in results folder containing the query start_time and end_time
    3. Saves query statistics in the same directory