/cloud-ready

Apache License 2.0Apache-2.0

Cloud-Ready Configuration for Yahoo Streaming Benchmarks

Introduction

Yahoo streaming framework is the first result to create a benchmark for state-of-the-art Big Data streaming platforms. However, the currently available Yahoo streaming benchmark only provides Cloud configuration used in the experiemnts of the paper for Auto-DaSP 2017. When you use the repository, please cite the following paper:

@inproceedings{ScalabilityAndState,
  title = {Scalability and State:
           {A} Critical Assessment of Throughput Obtainable on
           Big Data Streaming Frameworks for
           Applications With and Without State Information},
  author = {Shinhyung Yang and Yonguk Jeong and ChangWan Hong and Hyunje Jun
            and Bernd Burgstaller},
  booktitle = {Euro-Par 2017: Parallel Processing Workshops -
               Euro-Par 2017 International Workshops, Santiago de Compostela, Spain,
               August 28-29, 2017, Revised Selected Papers},
  series = "Lecture Notes in Computer Science",
  volume = "10659",
  pages = "141--152",
  year = "2017",
  doi = "https://doi.org/10.1007/978-3-319-75178-8\_12",
}

Prerequisites

From Yahoo's official repository on GitHub, the revision committed on Nov 23 2016 is used. This configuration requires 30 nodes on Cloud computing platform. I recommend to use Google Compute Engine. Each node is configured with 16 vCPUs and 24 GB main memory in my Cloud setup. Please note that each node has two properties: Instance_Name and Internal_IP. Instance_Name is specific to Google Compute Engine and is used as an alias for each node. You may want to create a Cloud node manually in order to manually allocate internal IPs. Otherwise, all occurrences of IP addresses in the provided patch files need to be replaced according to your setup, in which case refer to the table below.

Preparing the Yahoo streaming benchmark

Clone the benchmark from the official repository:

git clone https://github.com/yahoo/streaming-benchmarks.git

Reset to the specific revision from Nov 23 2016

cd streaming-benchmarks
git reset --hard b073202b04baa640840a09b206c101996c112b95

Do initial download and setup of the benchmark

./stream-bench.sh SETUP

Setup breaks after unsuccessful download of a package. If you encounter such problem directly download packages from archive.apache.org. E.g., have a look at the case below:

cd download-cache
rm flink-1.1.3-bin-hadoop27-scala_2.10.tgz spark-1.6.2-bin-hadoop2.6.tgz
wget http://archive.apache.org/dist/flink/flink-1.1.3/flink-1.1.3-bin-hadoop27-scala_2.10.tgz
wget http://archive.apache.org/dist/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
# Continue with setup
cd ..
./stream-bench.sh SETUP

Applying Patches

You may apply a patch file to your Yahoo streaming benchmark folder of each Cloud node according to the table below:

Instance_Name Patch_Name Internal_IP
streaming-group-0-0001 zk.patch 10.140.0.101
streaming-group-0-0002 zk.patch 10.140.0.102
streaming-group-0-0003 zk.patch 10.140.0.103
streaming-group-0-0004 redis.patch 10.140.0.104
streaming-group-0-0005 kafka-1.patch 10.140.0.105
streaming-group-0-0006 kafka-2.patch 10.140.0.106
streaming-group-0-0007 kafka-3.patch 10.140.0.107
streaming-group-0-0008 kafka-4.patch 10.140.0.108
streaming-group-0-0009 kafka-5.patch 10.140.0.109
streaming-group-0-0010 master.patch 10.140.0.110
streaming-group-0-0011 slave.patch 10.140.0.111
streaming-group-0-0012 slave.patch 10.140.0.112
streaming-group-0-0013 slave.patch 10.140.0.113
streaming-group-0-0014 slave.patch 10.140.0.114
streaming-group-0-0015 slave.patch 10.140.0.115
streaming-group-0-0016 slave.patch 10.140.0.116
streaming-group-0-0017 slave.patch 10.140.0.117
streaming-group-0-0018 slave.patch 10.140.0.118
streaming-group-0-0019 slave.patch 10.140.0.119
streaming-group-0-0020 slave.patch 10.140.0.120
streaming-group-0-0021 dg.patch 10.140.0.121
streaming-group-0-0022 dg.patch 10.140.0.122
streaming-group-0-0023 dg.patch 10.140.0.123
streaming-group-0-0024 dg.patch 10.140.0.124
streaming-group-0-0025 dg.patch 10.140.0.125
streaming-group-0-0026 dg.patch 10.140.0.126
streaming-group-0-0027 dg.patch 10.140.0.127
streaming-group-0-0028 dg.patch 10.140.0.128
streaming-group-0-0029 dg.patch 10.140.0.129
streaming-group-0-0030 dg.patch 10.140.0.130

Once you are ready with patch files, change current path to the parent of your Yahoo streaming benchmark directory and enter below command:

cd /path/to/your/streaming-benchmarks/..
patch -p0 --ignore-whitespace < /path/to/your/Patch_Name

Preparing Cloud Nodes

In order to run Kafka cluster, each Kafka broker node needs a log directory. Refer to logs.dir attribute of the Kafka configuration file in each of your Kafka broker nodes located at streaming-benchmarks/kafka_2.10-0.8.2.1/config.

Additional Installation

As soon as I was confronted with problems from using Storm's internal dev-zookeeper for setting up Zookeeper cluster, I added standalone zookeeper of the same version. Follow the instruction to add the application to the framework:

cd streaming-benchmarks
wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
tar xf zookeeper-3.4.6.tar.gz
rm zookeeper-3.4.6.tar.gz

In order to execute the benchmark's data-generator Clojure script, download and setup lein as follows:

mkdir ~/bin
cd ~/bin
wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein
chmod +x ./lein

Then, modify line # 19 of your streaming-benchmarks/stream-bench.sh to correctly locate your copy of lein.

LEIN="/home/your_account_name/bin/lein"

Contact

Please open a new issue to contact the author.

License

Code licensed under the Apache 2.0 license. See LICENSE file for terms.

Acknowledgements

This work was supported by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Science, ICT & Future Planning under grant NRF-2015-M3C4A-7065522.