/flink-sql-benchmark

Primary LanguageTSQLApache License 2.0Apache-2.0

flink-sql-benchmark

TPC-DS benchmark

Generate test hive dataset

  • Step 1: Prepare your environment

    Make sure you have Hadoop and Hive installed in your cluster. gcc is also needed to build the TPC-DS data generator.

  • Step 2: Build the data generator

    cd hive-tpcds-setup

    Run ./tpcds-build.sh

    Download and build the TPC-DS data generator.

  • Step 3: Generate TPC-DS dataset

    cd hive-tpcds-setup

    Run ./tpcds-setup.sh 10000. The hive database is tpcds_bin_orc_10000.

    Run ./tpcds-setup.sh <SCALE_FACTOR> to generate dataset. The "scale factor" represents how much data you will generate, which roughly translates to gigabytes. For example, ./tpcds-setup.sh 10 will generate about 10GB data. Note that the scale factor must be greater than 1.

    tpcds-setup.sh will launch a MapReduce job to generate the data in text format. By default, the generated data will be placed in /tmp/tpcds-generate/<SCALE_FACTOR> of your HDFS cluster. If the folder already exists, the MapReduce job will be skipped.

    Once data generation is completed, tpcds-setup.sh will load the data into Hive tables. Make sure the hive executable is in your PATH, alternatively, you can specify your Hive executable path via HIVE_BIN environment variable.

    tpcds-setup.sh will create external Hive tables based on the generated text files. These tables reside in a database named tpcds_text_<SCALE_FACTOR>. Then tpcds-setup.sh will convert the text tables into an optimized format and the converted tables are placed in database tpcds_bin_<FORMAT>_<SCALE_FACTOR>. By default, the optimized format is orc. You can choose a different format by setting the FORMAT environment variable. The following is an example that creates 1TB test dataset in parquet format:

    FORMAT=parquet HIVE_BIN=/path/to/hive ./tpcds-setup.sh 1000

    Once the data is loaded into Hive, you can use database tpcds_bin_<FORMAT>_<SCALE_FACTOR>to run the benchmark.

Run benchmark in flink

  • Step 1: Prepare your flink environment.

  • Step 2: Build test jar.

    • Modify flink version and hive version of pom.xml.

    • mvn clean install

  • Step 3: Run

    • flink_home/bin/flink run -c com.ververica.flink.benchmark.Benchmark ./flink-tpcds-0.1-SNAPSHOT-jar-with-dependencies.jar --database tpcds_bin_orc_10000 --hive_conf hive_home/conf
    • optional --location: sql queries path, default using queries in jar.
    • optional --queries: sql query names. If the value is 'all', all queries will be executed. eg: 'q1.sql'.
    • optional --iterations: The number of iterations that will be run per case, default is 1.
    • optional --parallelism: The parallelism, default is 800.

Run benchmark in other systems

Because the prepared test data is standard hive data, other calculation frameworks integrated with hive data can also run benchmark very simply. Please build your own environment and test it.

If you have any questions, please contact: