A testbench for experimenting with Apache Hive at any data scale.
Cloned from https://github.com/hortonworks/hive-testbench (TPC-H was also removed from framework as it was out of scope) and modified to allow new options for query engines and included format to be specified (orc/parquet) Additions to the following overview for Ceph Object Store Benchmark include: Wrappers scripts were added to run data generation and to run queries tpcds-generate.sh - runs scale factor 1TB,10TB and 100TB for both orc and parquet formats (change directory and comment out tests appropriately before running) tpcds-run.sh - used to run queries for Presto,Spark,Hive and Hive on Spark for various scale factors by passing in directory including queries to run, directory also should contain control file telling the order to run the queries (for example to run UC11 - ./tpcds-run.sh UC11) tpcds-concurrent-run.sh - same as tpcds-run.sh except also runs queries concurrently (for example to run UC11 with 4 concurrent threads - ./tpcds-concurrent-run.sh UC11 4)
The hive-testbench is a data generator and set of queries that lets you experiment with Apache Hive at scale. The testbench allows you to experience base Hive performance on large datasets, and gives an easy way to see the impact of Hive tuning parameters and advanced settings.
You will need:
- Hadoop 2.2 or later cluster or Sandbox.
- Apache Hive.
- Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
- If you plan to generate 1TB or more of data, using Apache Hive 13+ to generate the data is STRONGLY suggested.
All of these steps should be carried out on your Hadoop cluster.
-
Step 1: Prepare your environment.
In addition to Hadoop and Hive, before you begin ensure
gcc
is installed and available on your system path. If you system does not have it, install it using yum or apt-get. -
Step 2: Decide which test suite(s) you want to use.
hive-testbench comes with data generators and sample queries based on the TPC-DS benchmark. More information about these benchmarks can be found at the Transaction Processing Council homepage.
-
Step 3: Compile and package the appropriate data generator.
For TPC-DS,
./tpcds-build.sh
downloads, compiles and packages the TPC-DS data generator. -
Step 4: Decide how much data you want to generate.
You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.
If you want to generate a large amount of data, you should use Hive 13 or later. Hive 13 introduced an optimization that allows far more scalable data partitioning. Hive 12 and lower will likely crash if you generate more than a few hundred GB of data and tuning around the problem is difficult. You can generate text or RCFile data in Hive 13 and use it in multiple versions of Hive.
-
Step 5: Generate and load the data.
The script
tpcds-setup.sh
generates and loads data for TPC-DS. General usage istpcds-setup.sh scale_factor [directory]
Some examples:
Build 1 TB of TPC-DS data:
./tpcds-setup 1000
Build 1 TB of TPC-H data:
./tpch-setup 1000
Build 100 TB of TPC-DS data:
./tpcds-setup 100000
Build 30 TB of text formatted TPC-DS data:
FORMAT=textfile ./tpcds-setup 30000
Build 30 TB of RCFile formatted TPC-DS data:
FORMAT=rcfile ./tpcds-setup 30000
Also check other parameters in setup scripts important one is BUCKET_DATA.
-
Step 6: Run queries.
More than 50 sample TPC-DS queries are included for you to try. You can use
hive
,beeline
or the SQL tool of your choice. The testbench also includes a set of suggested settings.This example assumes you have generated 1 TB of TPC-DS data during Step 5:
cd sample-queries-tpcds hive -i testbench.settings hive> use tpcds_bin_partitioned_orc_1000; hive> source query55.sql;
Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpch_flat_orc_1000. You can always
show databases
to get a list of available databases.Similarly, if you generated 1 TB of TPC-H data during Step 5:
cd sample-queries-tpch hive -i testbench.settings hive> use tpch_flat_orc_1000; hive> source tpch_query1.sql;
If you have questions, comments or problems, visit the Hortonworks Hive forum.
If you have improvements, pull requests are accepted.