/spark-demo

A simple project intended to demo spark and get developers up and running quickly

Primary LanguageScala

spark-demo

A simple project intended to demo spark and get developers up and running quickly

Note: This project uses Gradle. You must install Gradle(1.12). If you would rather not install Gradle locally you can use the Gradle Wrapper by replacing all refernces to gradle with gradlew.

How To Build:

  1. Execute gradle build
  2. Find the artifact jars in './build/libs/'

Intellij Project Setup:

  1. Execute gradle idea
  2. Open project folder in Intellij or open the generated .ipr file

Note: If you have any issues in Intellij a good first troubleshooting step is to execute gradle cleanIdea idea

Eclipse Project Setup:

  1. Execute gradle eclipse
  2. Open the project folder in Eclipse

Note: If you have any issues in Eclipse a good first troubleshooting step is to execute gradle cleanEclipse eclipse

Key Spark Links:

Using The Project:

Note: This guide has only been tested on Mac OS X and may assume tools that are specific to it. If working in another OS substitutes may need to be used but should be available.

Step 1 - Build the Project:

  1. Run gradle build

Step 2 - Run the Demos in Local mode:

The demos generally take the first argument as the Spark Master URL. Setting this value to 'local' runs the demo in local mode. The trailing number in the brackets '[#]' indicates the number of cores to use. (ex. 'local[2]' runs locally with 2 cores)

This project has a Gradle task called 'runSpark' that manages the runtime classpath for you. This simplifies running spark jobs, ensures the same classpath is used in all modes, and shortens the development feedback loop.

The 'runSpark' Gradle task takes two arguments '-PsparkMain' and '-PsparkArgs':

  • -PsparkMain: The main class to run.
  • -PsparkArgs: The arguments to be passed to the main class. See the class for documentation and what arguments are expected.

Below are some sample commands for some simple demos:

  • SparkPi: gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PskipHadoopJar -PsparkArgs="local[2] 100"
  • Sessionize: gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PskipHadoopJar -PsparkArgs="local[2]"
  • HdfsWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PskipHadoopJar -PsparkArgs="local[2] streaming-input"
  • NetworkWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PskipHadoopJar -PsparkArgs="local[2] localhost 9999"

Note: The remaining steps are only required for running demos in "pseudo-distributed" mode and on a cluster.

Step 3 - Install Spark:

  1. Install Spark 1.0 using Homebrew: brew install apache-spark
  2. Add SPARK_HOME to your .bash_profile: export SPARK_HOME=/usr/local/Cellar/apache-spark/1.0.0/libexec
  3. Add SCALA_HOME and JAVA_HOME to your .bash_profile

Note: You may also install on your own following the Spark Documentation

Step 4 - Configure & Start Spark:

  1. The defaults should work for now. However, See Cluster Launch Scripts documentation for more information on configuring your pseudo cluster.
  2. Start your Spark cluster: $SPARK_HOME/sbin/start-all.sh
  3. Validate the master & worker are running in the Spark Master WebUI
  4. Note the master URL on the Spark Master WebUI. It will be used when submitting jobs.
  5. Shutdown when done: $SPARK_HOME/sbin/stop-all.sh

Step 5 - Run the Demos in Pseudo-Distributed mode:

Running in pseudo-distributed mode is almost exactly the same as local mode. Note: Please see step 2 before continuing on.

To run in pseudo-distributed mode just replace 'local[#]' in the Spark Master URL argument with the URL from Step 4.

Below are some sample commands for each demo:

Note: You will need to substitute in your Spark Master URL

  • SparkPi: gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PsparkArgs="spark://example:7077 100"
  • Sessionize: gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PsparkArgs="spark://example:7077"
  • HdfsWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PsparkArgs="spark://example:7077 streaming-input"
  • NetworkWordCount: gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PsparkArgs="spark://example:7077 localhost 9999"

Step 6 - Run the Demos on a cluster:

The build creates a fat jar tagged with '-hadoop' that contains all dependencies needed to run on the cluster. The jar can be found in './build/libs/'.

TODO: Test this and fill out steps.

Step 7 - Develop your own Demos:

Develop demos of your own and send a pull request!

Notable Tools & Frameworks:

Todo List:

  • Create trait/class with generic context, smart defaults, and unified arg parsing (see spark-submit script for ref)
  • Document whats demonstrated in each demo (avro, parquet, kryo, etc) and usage
  • Add module level readme and docs
  • Tune logging output configuration (Redirect verbose logs into a rolling file)
  • Speed up HadoopJar task (and runSpark will follow)

Demos Working On: