Mallet - A Decision Support Benchmark Derived from TPC-DS
-
Release version: 1.0
-
Release date: 2014/7/14
-
Contact: Rui Sun, Lan Yi, Hao Cheng, Jiangang, Duan
-
Homepage: https://github.com/intel-hadoop/Mallet
-
Contents:
- Overview
- Getting Started
- Project Directory Layout
- Limitations and Known Issues
OVERVIEW
Mallet is an open source decision support benchmark for HiveQL-compatible SQL engines, which is derived from the TPC-DS benchmark. It basically follows the TPC-DS's modeling of several generally applicable aspects of a decision support system, including the database schema, data population, queries, data maintenance. Note that the result of Mallet is not comparable to any published TPC-DS Benchmark results.
Mallet implements the Driver and follows the benchmark procedure defined in the TPC-DS 1.1.0 specification. Mallet performs the Load Test, the Power Test and the Throughput Test 1 & 2. It reports part of the metric indicators defined in the specification. A text based report with the primary performance metric (QphM@SF) will be generated upon the completion of the benchmark.
Mallet utilizes the TPC-DS toolkit for data preparation, i.e., generation of the table and maintenance data. To make the data generation fast, Mallet takes advantage of the Hadoop streaming and the data chunk generation support by the DSDGEN tool for distributed data generation. The generated data is stored in HDFS for accesses. For HIVE and SHARK, the data will be loaded in place as external tables.
Mallet's workload consists of 65 queries in HiveQL which are converted from the corresponding TPC-DS queries in SQL. Mallet's workload also implements in HiveQL all of the data maintenance functions defined in the TPC-DS 1.1.0 specification.
Mallet uses JDBC to present workload to the target database, so it works with HiveQL-compatible database applications with JDBC support, such as HIVE,Shark.
Getting Started
Prerequisites
-
Java 1.6
-
The TPC-DS Toolkit
You need to download the TPC-DS software package from the TPC webiste. Follow the guide in the package to build the DSQGEN and DSDGEN tools from the source code contained in the package. Make sure the binaries built from the source can execute on each node of your Hadoop cluster without problems caused by differences of operating environments, for example, GLIBC version.
-
Hadoop 1.x or 2.x
Make sure you set the
HADOOP_HOME
environment variable or put the<HADOOP Home>/bin
directory in the PATH environment variable.
For Hadoop 1.x, only Hadoop 1.0.4 was tested.
For Hadoop 2.x, only CDH 5.0 beta YARN mode was tested.
Requirements on Target Databases
HIVE
If you want to run Mallet with HIVE as the target database, HIVE 0.13.1 or 0.12.0 is required. HIVE 0.13.1 is the default target.
- HIVE 0.13.1
You need to download the source package of HIVE 0.13.1 from http://hive.apache.org/downloads.html and apply the patch file (src/main/resources/hive/hive-0.13.1.patch
), then build the HIVE binary.
- HIVE 0.12.0
You need to download the source package of HIVE 0.12.0 from http://hive.apache.org/downloads.html and apply the patch file (src/main/resources/hive/hive-0.12.0.patch
), then build the HIVE binary.
Then you need to modify pom.xml
to changet the version for the dependency hive-jdbc
to 0.12.0. Then build Mallet.
SHARK
If you want to run Mallet with SHARK as the target database, SHARK 0.9.0 is required.
Install Mallet
-
Checkout the source code of Mallet in the open source repository to your local directory.
-
In your local Mallet directory, type
mvn clean package
to build Mallet.Note: If you intend to run Mallet with SHARK, change the version of the
hive-jdbc
dependency in pom.xml from 0.12.0 to 0.11.0. -
After build, the Mallet binary can be found at
target/mallet-1.0-bin/mallet-1.0/
.
Configure Mallet
-
Copy the TPC-DS Tool Binaries
It seems the DSQGEN tool has problem with long path, so the workaround is to copy 3 TPC-DS tool binaries you built before (
dsdgen, dsqgen, tpcds.idx
) to thetools
sub-directory of<Mallet binary directory>
. -
Configure Benchmark Parameters
You can modify benchmark parameters in
<Mallet binary directory>/conf/conf.properties
:hiveServerHost < The host name of the target databases's JDBC service > hiveServerPort < The port number of the target databases's JDBC service > numberOfStreams < The number of query streams in the Throughput Tests > scaleFactor < The scale factor. Valid options are 1, 100, 300, 1000, 3000, 10000, 30000, 100000 > user < The username used to connect to JDBC > malletDbDir < The HDFS root directory for the Mallet data >
-
Configure Data Preparation
You need to set some global environment variables in
<Mallet binary directory>/bin/config.sh
:PARALLEL < The number of data chunks generated in parallel > HADOOP_EXECUTABLE < The Hadoop executable location. Optional, set if it can't be automatically discovered > HADOOP_CONF_DIR < The hadoop configuration directory. Optional, set if it can't be automatically discovered > STREAMING < The path to Hadoop streaming jar. Optional, set if it can't be automatically discovered > COMPRESS_GLOBAL < Switch on/off the compression for the generated data, 0 is disable, 1 is enable. Optional > COMPRESS_CODEC_GLOBAL < The default codec used for data compression. Optional >
Note:
- Mallet will guess the value of these variables if they are not explicitly set. If so, Mallet guarantees neither the correctness of guess nor the success running of benchmarks.
- Do not change the default values of other global environment variables unless necessary.
Run Mallet
-
cd
<Mallet binary directory>
-
Make sure Hadoop is running. type
bin/prepare.sh
to generate table and maintenance data.You may try to increase the PARALLEL variable (the recommended value is (Map task capacity of the cluster*2)/(21+numberOfStreams)) to reduce the duration of data generation.
-
Make sure the target database is running. type
bin/run.sh
to start the benchmark.
bin/run.sh
without any command line options performs a complete benchmark. In some cases, you can provide one of the following command line options to alter the behavior:
--quickrun Performs the benchmark with empty query and data maintenance operations. Used to facilitate development and verify installation and environment settings.
--powertest Performs only power test.
--query <query id> Performs only a single query.
- Upon the completion of the benchmark, a report with the primary performance metric (QphM@SF) will be generated in the stdout.
The following is an abbreviated sample report:
------------------- Mallet Benchmark Report --------------------
Number of Query Streams: 4
Number of queries in Query Stream: 65
Database Load Elapsed Time: 0h:0m:7s
Power Test Elapsed Time: 4h:24m:51s
Throughput Run 1 Elapsed Time: 6h:24m:24s
Query Run 1 Elapsed Time: 5h:14m:10s
Refresh Run 1 Elapsed Time: 1h:10m:13s
Throughput Run 2 Elapsed Time: 6h:17m:7s
Query Run 2 Elapsed Time: 5h:6m:58s
Refresh Run 2 Elapsed Time: 1h:10m:8s
----------------------------------------------------------------
Performance Metric = 25.700481928957096 QphM@1GB
----------------------------------------------------------------
---------- Query Run 1 Timing Intervals (in seconds) -----------
Query Minimum Median Maximum
2 225.712 235.046 247.064
...
98 157.288 203.679 217.967
---------- Query Run 2 Timing Intervals (in seconds) -----------
Query Minimum Median Maximum
2 201.364 215.648 227.006
...
98 158.62 210.739 220.181
Project Directory Layout
src/main
|- java -- The Mallet benchmark driver source code implemented in JAVA.
|- config -- The configurations for the benchmark, such as number of streams, Scale Factor, database JDBC server and port, …
|- scripts -- The Shell scripts for data preparation and benchmark running.
|- resources
|- dm_function -- DM functions in HiveQL for Data Maintenance.
|- hive -- HiveQL scripts to create TPC-DS tables and refresh tables.
|- query_templates -- query templates in HiveQL.
Limitations and Known Issues
- Mallet contains only 65 queries, which is a subset of all 99 queries defined in the TPC-DS 1.1.0 specification.
- Mallet relaxes the ACID requirements on the target databases in the TPC-DS 1.1.0 specification.
- Mallet does not report price related metrics defined in the TPC-DS 1.1.0 specification.