/BaBench

Primary LanguageScalaApache License 2.0Apache-2.0

Bigdata Benchmark suite (BaBench)

A scalable, easy to use OLAP benchmark suite.


OVERVIEW

babench is a big data benchmark suite that helps evaluate different big data framework ( such as Spark SQL, Hive, Impala, and etc ). By now, babench contains TPC-DS and TPC-H, two commonly used decision support system benchmarks. babench can be easily used to benchmark Spark, Hive and etc. babench will support more benchmarks in the future. babench will also support cloud-native systems and service monitoring system ( such as prometheus ) later.


Getting Started

Before your test, make sure you have deployed hadoop and spark environment, checking with commands:

hadoop version
spark-shell --version

1. Build babench

2. Configure slaves

  • Copy "slaves.template" to "slaves" in folder conf.
  • Specify the hostname/ip of every node, one hostname/ip per line. Such as:
slave1
slave2
slave3

3.Initialize the Environment

4. Prepare benchmark Data

babench generates benchmark data based on Spark. Making sure you have Spark environment in your cluster. And the more resources allocated to Spark, the faster data is generated. For details about Spark Tuning, see Spark Tuning Guides.

1) Genenrate TPC-DS Data

  • Specify the configuration in bin/GenerateTpcdsData.sh:

    datascale ( decides the data scale of generated data, in GB )

    onlyInitializeMetastore ( usually keep it False, decides whether to skip the data generating and create tables directly )

#!/bin/bash

# configurations
dataScale=500
onlyInitializeMetastore=False

2) Genenrate TPC-H Data

  • Specify the configuration in bin/GenerateTpchData.sh:

    datascale ( decides the data scale of generated data, in GB )

    onlyInitializeMetastore ( usually keep it False, decides whether to skip the data generating and create tables directly )

#!/bin/bash

# configurations
dataScale=500
onlyInitializeMetastore=False

5. Start Benchmarking

Currently, babench provides test scripts of Spark and Hive.

1) Run TPC-DS Benchmark

#!/bin/bash

# configurations
dataScale=500
selectedQueries=q1,q2,q3

2) Run TPC-H Benchmark

#!/bin/bash

# configurations
dataScale=500
selectedQueries=q1,q2,q3

5. Benchmark Results

Framework    BenchmarkName     Queries     Durations    StartAt     StopAt    DurationSum      Datasize     FinalStatus