This repo contains pre-generated query stream files for a few different scale factors. The convention for TPC-DS is "sfXXXX" where XXXX is the size of the raw data set in gigabytes (GB).
Sacle Factor | Raw Dataset Size (approximate) | Directory Name |
---|---|---|
sf100 | 100 GB | sf100_queries |
sf1000 | 1 TB (1,000 GB) | sf1000_queries |
sf3000 | 3 TB (3,000 GB) | sf3000_queries |
sf10000 | 10 TB (10,000 GB) | sf10000_queries |
For each scale, there are four query stream files, named query_N.sql.
Details about the official TPD-DS benchmark can be found at https://www.tpc.org/tpcds/default5.asp. The queries are in a pre-determined (pseudo random) order per the TPC-DS spec.
In general, any results obtained using the query stream files contained in this repos should follow the guidelines from TPC.org for the TPC DS benchmark which are spelled out explicitly in the license agreement for the benchmark.
The query stream files in this repo were generated using the nds_gen_query_stream.py utility which can be found in the NVIDIA/spark_rapids_benchmarks github repo.
nds_gen_query_stream.py is essentially a wrapper for the offiical TPC-DS query generator (dsqgen).
The syntax for the command line is documented here.
usage: nds_gen_query_stream.py [-h] (--template TEMPLATE | --streams STREAMS)
template_dir scale output_dir
positional arguments:
template_dir directory to find query templates and dialect file.
scale assume a database of this scale factor.
output_dir generate query in directory.
optional arguments:
-h, --help show this help message and exit
--template TEMPLATE build queries from this template. Only used to generate one query from one tempalte. This argument is mutually exclusive with --streams. It
is often used for test purpose.
--streams STREAMS generate how many query streams. This argument is mutually exclusive with --template.
--rngseed RNGSEED seed the random generation seed.