Typesafe Apache Spark Workshop Exercises

Copyright © 2014-2015, Typesafe, All Rights Reserved.
Apache Spark and the Spark Logo are trademarks of The Apache Software Foundation.

Introduction

These exercises are part of the Typesafe Apache Spark Workshop, which teaches you why Spark is important for modern data-centric computing, and how to write Spark batch-mode, streaming, and SQL-based applications, and deployment options for Spark. We'll also introduce Spark modules for graph algorithms and machine learning, and we'll see how Spark can work as part of a larger reactive application implemented with the Typesafe Reactive Platform (TRP).

The Spark Version

This workshop uses Spark 1.4.0. It also uses Scala 2.11. Even though the Spark builds found at the Apache site are only for 2.10, in fact there are 2.11 artifacts in the Apache maven repositories.

However, the 2.11 builds are considered experimental at this time. Consider using Scala 2.10 for production software.

The following documentation links provide more information about Spark:

The Documentation includes a getting-started guide and overviews. You'll find the Scaladocs API useful for the workshop.

Setup

You'll need to install sbt, which you can use for all the exercises, or to bootstrap Eclipse. Installing sbt isn't necessary if you use IntelliJ. See the sbt website for instructions on installing sbt.

You were given a download link to a zip file with the exercises. Unzip it in a convenient work directory and then select pick from the following subsections depending on how you want to work with the exercises:

Working with sbt and a Text Editor

Open a terminal/console window and change to the working directory where you expanded the exercises. Run the sbt command, which puts you at the sbt prompt, then run the test "task", which downloads all dependencies, compiles the main code and the test code, and then runs the tests. It should finish with a success message. Here are the steps, where $ is used as the "shell" or command prompt, > is the sbt prompt, and the #... are comments you shouldn't type in:

$ sbt
[info] ...                         # Information messages as sbt starts
> test
...
[success] Total time: ...          # Successfully compiled and tested the code.

Stay in sbt for the rest of the workshop. You'll run all your commands from here. Open the exercises in your favorite text editor.

Working with Eclipse (Scala-IDE)

Since Eclipse plugins are "temperamental", we recommend downloading a complete Eclipse distribution with the Scala plugin installed from http://scala-ide.org. Also, we have found that the Scala 2.11.X version of the IDE has problems with the workshop project, so download the IDE for 2.10.X. However, this site also has the plugin URLs if you prefer trying that step first.

Unfortunately, a ScalaTest plugin is not included, and it appears that the "incubator" version is obsolete that's hosted on the Scala-IDE update site, http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site. Try the instructions on this scalatest.org page or use the sbt console to run the tests.

You'll need to generate Eclipse project files using sbt. (You can do this while Eclipse is doing its thing.) Open a terminal/console window and change to the working directory where you expanded the exercises. Run the following sbt command to generate the project files:

sbt eclipse

It will take a few minutes, as it has to first download all the project dependencies.

Once it completes, start Eclipse and use the File > Import menu option, then use the dialog to import the project you just created.

Working with IntelliJ IDE.

Make sure you have installed the Scala plugin. If so, you can import the exercises as an SBT project.

Going Forward from Here

There is also a README in the data directory. It describes the data we'll use for the exercises and where it came from.

To learn more, see the following:

The Apache Spark website.
The Apache Spark documentation.
The Apache Spark Quick Start. See also the examples in the Spark distribution and be sure to study the Scaladoc pages for key types such as SparkContext, SQLContext, RDD, and DataFrame`.
Talks from Spark Summit 2013, 2014, and 2015.

Experience Reports:

Spark at Twitter

Other Spark Based Libraries:

Spark Notebook - An interactive, web-based environment similar to ipython.
Zeppelin - Another notebook.
Snowplow's Spark Example Project.
Thunder - Large-scale neural data analysis with Spark.

For more about Typesafe:

See Typesafe Reactive Big Data to find other Activator templates.
See Typesafe Activator to find other Activator templates.
See Typesafe for more information about our products and services.

colinfkennedy/spark-training