Visit the docs website: https://sparktc.github.io/spark-bench/
- Current VS. Legacy Version
- Current Spark version supported by spark-bench: 2.1.1
- Documentation
- Installation
- Building It Yourself
- Running the Examples From The Distribution
- Previewing the Github Pages Site Locally
spark-bench has recently gone through an extensive rewrite. While we think you'll like the new capabilities, it is not quite feature complete with the previous version of spark-bench. Many of the workloads that were available in the legacy have not yet been ported over, but they will be!
In the meantime, if you would like to see the old version of spark-bench, it's preserved in the legacy branch.
You can also grab the last official release of the legacy version from here.
Visit the docs website: https://sparktc.github.io/spark-bench/
- Grab the latest release from here: https://github.com/ecurtin/spark-bench/releases/latest.
- Unpack the tarball using
tar -xvzf
. cd
into the newly created folder.- Modify
SPARK_HOME
andSPARK_MASTER_HOST
inbin/spark-bench-env.sh
to reflect your environment. - Start using spark-bench!
Alternatively, you can also clone this repo and build it yourself.
First, install SBT according to the instructions for your system: http://www.scala-sbt.org/0.13/docs/Setup.html
Clone this repo.
git clone https://github.com/ecurtin/spark-bench.git
cd spark-bench/
The latest changes will always be on develop, the stable version is master. Optionally check out develop here, or skip this step to stay on master.
git checkout develop
Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. I recommend adding the following line to your bash_profile:
export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M"
Now you're ready to test spark-bench, if you so desire.
sbt test
And finally to build the distribution folder and associated tar file.
sbt dist
The spark-bench distribution comes bundled with example scripts and configuration files that should run out out the box with only very limited setup.
If you installed spark-bench by unpacking the tar file, you're ready to go. If you cloned the repo, first run
sbt dist
and then change into that generated folder.
Inside the bin
folder is a file called spark-bench-env.sh
. In this folder are two environment variables
that you will be required to set. The first is SPARK_HOME
which is simply the full path to the top level of your
Spark installation on your laptop or cluster. The second is SPARK_MASTER_HOST which is the same as what you
would enter as --master
in a spark submit script for this environment. This might be local[2]
on your laptop,
yarn
on a Yarn cluster, an IP address and port if you're running in standalone mode, you get the idea!
You can set those environment variables in your bash profile or by uncommenting the lines in spark-bench-env.sh
and filling them out in place.
From the spark-bench distribution file, simply run:
./examples/multi-submit-sparkpi/multi-submit-example.sh
The example scripts and associated configuration files are a great starting point for learning spark-bench by example. The kmeans example shows some examples of using the spark-bench CLI while the multi-submit example shows more thorough usage of a configuration file.
The spark-bench documentation at https://sparktc.github.io/spark-bench/ is generated from files in the docs/
folder.
To see the Jekyll site locally:
-
Follow the instructions from Github regarding installing Ruby, bundler, etc.
-
From the
docs/
folder, runbundle exec jekyll serve
and navigate in your browser to127.0.0.1:4000