- Homepage: https://github.com/xinjinhan/BaBench.git
- Contents:
- Overview
- Getting Started
- Build Babench
- Configuration
- Initialize the Environment
- Prepare Benchmark Data
- Run benchmark
babench is a big data benchmark suite that helps evaluate different big data framework ( such as Spark SQL, Hive, Impala, and etc ). By now, babench contains TPC-DS and TPC-H, two commonly used decision support system benchmarks. babench can be easily used to benchmark Spark, Hive and etc. babench will support more benchmarks in the future. babench will also support cloud-native systems and service monitoring system ( such as prometheus ) later.
Before your test, make sure you have deployed hadoop and spark environment, checking with commands:
hadoop version
spark-shell --version
- Copy "slaves.template" to "slaves" in folder conf.
- Specify the hostname/ip of every node, one hostname/ip per line. Such as:
slave1
slave2
slave3
- Execute bin/InitializeEnvironment.sh
babench generates benchmark data based on Spark. Making sure you have Spark environment in your cluster. And the more resources allocated to Spark, the faster data is generated. For details about Spark Tuning, see Spark Tuning Guides.
-
Specify the configuration in bin/GenerateTpcdsData.sh:
datascale ( decides the data scale of generated data, in GB )
onlyInitializeMetastore ( usually keep it False, decides whether to skip the data generating and create tables directly )
#!/bin/bash
# configurations
dataScale=500
onlyInitializeMetastore=False
- Execute the bin/GenerateTpcdsData.sh in the master node.
-
Specify the configuration in bin/GenerateTpchData.sh:
datascale ( decides the data scale of generated data, in GB )
onlyInitializeMetastore ( usually keep it False, decides whether to skip the data generating and create tables directly )
#!/bin/bash
# configurations
dataScale=500
onlyInitializeMetastore=False
- Execute the bin/GenerateTpchData.sh in the master node.
Currently, babench provides test scripts of Spark and Hive.
-
Specify the configuration in bin/TestSparkWithTpcds.sh or bin/TestHiveWithTpcds.sh:
datascale ( decides the data scale of TPC-DS benchmark, in GB )
selectedQueries ( decides which queries of TPC-DS to be tested, all queries can be found in querySamples/tpcds )
#!/bin/bash
# configurations
dataScale=500
selectedQueries=q1,q2,q3
- Directly run bin/TestSparkWithTpcds.sh or bin/TestHiveWithTpcds.sh:
-
Specify the configuration in bin/TestSparkWithTpch.sh or bin/TestHiveWithTpch.sh:
datascale ( decides the data scale of TPC-DS benchmark, in GB )
selectedQueries ( decides which queries of TPC-DS to be tested, all queries can be found in querySamples/tpch )
#!/bin/bash
# configurations
dataScale=500
selectedQueries=q1,q2,q3
- Directly run bin/TestSparkWithTpch.sh or bin/TestHiveWithTpch.sh:
- babench saves main results into babench.report, shown as:
Framework BenchmarkName Queries Durations StartAt StopAt DurationSum Datasize FinalStatus