/storm-benchmark

Primary LanguageJavaApache License 2.0Apache-2.0

Storm Benchmark Build Status

How do we measure storm performance

The benchmark set contains 9 workloads. They fall into two categories. The first category is "simple resource benchmark", the goal is to test how storm performs under pressure of certain resource. The second category is to measure how storm performs in real-life typical use cases.

  • Simple resource benchmarks:

    • wordcount, CPU sensitive
    • sol, network sensitive
    • rollingsort, memory sensitive
  • Typical use-case benchmark:

    • rollingcount
    • trident
    • uniquevisitor
    • pageview
    • grep
    • dataclean
    • drpc

In real-life use cases, Kafka is often used for data ingestion. To acccount for that, most use-case benchmarks read data from Kafka and they could be categorized by the corresponding data generators:

  • data generated by FileReadKafkaProducer

    • dataclean
    • drpc
    • pageview
    • uniquevisitor
  • data generated by PageViewKafkaProducer

    • grep
    • trident

The data generators are already provided and they are Storm applications as well.

How to use

We assume a Storm cluster is already set up locally.

  1. Build.

First, build storm-benchmark.

  git clone https://github.com/manuzhang/storm-benchmark.git
  mvn package
  1. Run. We use SOL as an example.
  bin/stormbench -storm ${STORM_HOME}/bin/storm -jar ./target/storm-benchmark-${VERSION}-jar-with-dependencies.jar -conf ./conf/sol.yaml -c topology.workers=2 storm.benchmark.tools.Runner storm.benchmark.benchmarks.SOL 
  • -storm directs stormbench to look for the storm command
  • -jar sets the benchmark jar with all the dependencies in
  • -conf is for user to provide a yaml conf file like storm/conf/storm.yaml. Check the storm-benchmark/conf folder where conf files are already provided for existing benchmarks
  • -c allows user to set conf through command line without modifying conf files every time
  1. Check. The benchmark results will be stored at config path METRICS_PATH(default is: reports). It contains throughput data and latency of the whole cluster.

The result of SOL contains two files

1. `SOL_metrics_1402148415021.csv`. Performance data.
2. `SOL_metrics_1402148415021.yaml`. The config used to run this test.

How to run a benchmark ingesting data from Kafka

We assume Storm and Kafka have been set up locally. (No need to create Kafka topic beforehand, which could be auto created when the producer sends messages to Kafka). Also, assume Storm Benchmark has been built successfully.

Here's how we run uniquevisitor, for instance.

  1. Launch PageViewKafkaProducer.
  bin/stormbench -storm ${STORM_HOME}/bin/storm -jar ./target/storm-benchmark-${VERSION}-jar-with-dependencies.jar -conf ./conf/pageview_producer.yaml storm.benchmark.tools.Runner storm.benchmark.tools.producer.kafka.PageViewKafkaProducer 
  1. Launch UniqueVisitor.
  bin/stormbench -storm ${STORM_HOME}/bin/storm -jar ./target/storm-benchmark-${VERSION}-jar-with-dependencies.jar -conf ./conf/uniquevisitor.yaml storm.benchmark.tools.Runner storm.benchmark.benchmarks.UniqueVisitor 

Then, we could check the metrics data as in the previous section.

Supports

Please contact:

Acknowledgement

We use the SOL benchmark code(https://github.com/yahoo/storm-perf-test) from yahoo. Thanks.