HiBench-2.1: A Java repository from nkaul

### Hadoop Benchmark Suite (HiBench) ###
HiBench - the Hadoop Benchmark Suite

Release version: 2.1
Release date: 2012/06/13
Contact: Lan Yi(lan.yi@intel.com), Yan Liu(yan.b.liu@intel.com), Jason Dai(jason.dai@intel.com)
Homepage: https://github.com/hibench/HiBench-2.1

Contents:
1 Overview
2 Getting Started
3 Running

* 1 OVERVIEW
This benchmark suite contains 9 typical Hadoop workloads (including micro benchmarks, HDFS benchmarks, web search benchmarks, machine learning benchmarks, and data analytics benchmarks). This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib). Some initial work based on this benchmark suite please refer to the included ICDE workshop paper (i.e., WISS10_conf_full_011.pdf).

Note: Currently only the bayes bench and nutchindex bench in this suite have to work together with HiBench-Data.

Micro Benchmarks:
1) Sort (sort)
This workload sorts its *text* input data, which is generated using the Hadoop RandomTextWriter example.

2) WordCount (wordcount)
This workload counts the occurrence of each word in the input data, which are generated using the Hadoop RandomTextWriter example. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.

3) TeraSort (terasort)
TeraSort is a standard benchmark created by Jim Gray. It sorts 10 billion 100-byte records (as generated by the Hadoop TeraGen example).

HDFS Benchmarks:
4) enhanced DFSIO (dfsioe)
Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster.

Web Search Benchmarks:
5) Nutch indexing (nutchindexing)
Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The crawler sub-system in Nutch is used to crawl an in-house Wikipedia mirror and generate totally 8.4GB compresed data (for about 2.4M webpages) as the input of the workload.

6) PageRank (pagerank)
The workloads contains an implementation of the PageRank algorithm on Hadoop (a search engine ranking benchmark included in Mahout 0.6). The workload uses the automatically generated Web data whose hyperlinks follow the Zipfian distribution.

Machine Learning Benchmarks:
7) Mahout Bayesian classification (bayes)
Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout, which is an open source (Apache project) machine learning library. The workload uses a subset of the Wikipedia dump (as of date 2008/03/12) as the input.

8) Mahout K-means clustering (kmeans)
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.6. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.

Data Analytics Benchmarks:
9) Hive Query Benchmarks (hivebench)
This workload is developed based on SIGMOD 09 paper "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution.

* 2 Getting Started

2.0 Prerequisites

2.0.1 Setup HiBench-2.1
To set up the HiBench-2.1 working environment, you should
1. Download/checkout HiBench-2.1 benchmark suite from https://github.com/hibench/HiBench-2.1
2. Download HiBench-Data packages from https://www.dropbox.com/sh/f4k8dioyy7ee1l4/s3F1crAXP-
3. Create/setup directories as below and unzip the downloaded packages to correct folders accordingly
--HiBench-All (or any name you would like to use)
|--HiBench-2.1
|--HiBench-Data
|-- wikibayes
|-- wikinutch

2.0.2 Setup Hadoop
Before you run any workload in the package, please verify the Hadoop framework is running correctly. All the workloads have been tested with Cloudera Distribution of Hadoop 3 update 4 (CDH3u4) and Hadoop version 1.0.3

2.0.3 Setup Hive (for hivebench)
Please make sure you have properly set up Hive in your cluster if you want to test hivebench. Or the benchmark willuse the default Hive-0.9 release which is included in package.

2.1 Configure for the all workloads

You need to set some global environmental variables in the "configure.sh" file located in the root dir.
HIBENCH_HOME <The HiBench installation location>
HADOOP_HOME <The Hadoop installation location>
HIVE_HOME <The Hive installation location>
MAHOUT_HOME <The Mahout installation location>
HADOOP_CONF_DIR <The hadoop configuration DIR, default is $HADOOP_HOME/conf>
COMPRESS_GLOBAL <whether to enable the in/out compression for all workloads, 0 is disable, 1 is enable>

2.2 Configure each workload

You can modify the "configure.sh" file under each workload folder if it exists. All the data size and options related to the workload are defined in this file.

2.3 Synchronize the time on all nodes (This is required for dfsioe, and optional for others)

* 3 Running

3.0 Run several workloads together
The "benchmarks.lst" file under the package folder defines the workloads to run when you execute the "run-all.sh" script under the package folder. Each line in the list file specifies one workload. You can use "#" at the beginning of each line to skip the corresponding bench if necessary.

3.1 Run each workload separately
You can also run each workload separately. In general, there are 3 different files under one workload folder.

configure.sh Configuration file contains all parameters such as data size and test options.
prepare*.sh Generate or copy the job input data into HDFS.
run*.sh Execute the workload

Follow the steps below to run a workload
a) Configure the benchmark: set your own configurations by modifying configure.sh if necessary
b) Prepare data: "bash ./prepare.sh" ("bash ./prepare-read.sh" for dfsioe) to prepare input data in HDFS for running the benchmark
c) Run the benchmark: "bash .run*.sh" to run the corresponding benchmark

nkaul/HiBench-2.1