Copyright © 2014-2015, Typesafe, All Rights Reserved.
Apache Spark and the Spark Logo are trademarks of The Apache Software Foundation.
These exercises are part of the Typesafe Apache Spark Workshop, which teaches you why Spark is important for modern data-centric computing, and how to write Spark batch-mode, streaming, and SQL-based applications, and deployment options for Spark. We'll also introduce Spark modules for graph algorithms and machine learning, and we'll see how Spark can work as part of a larger reactive application implemented with the Typesafe Reactive Platform (TRP).
This workshop uses Spark 1.4.0. It also uses Scala 2.11. Even though the Spark builds found at the Apache site are only for 2.10, in fact there are 2.11 artifacts in the Apache maven repositories.
However, the 2.11 builds are considered experimental at this time. Consider using Scala 2.10 for production software.
The following documentation links provide more information about Spark:
The Documentation includes a getting-started guide and overviews. You'll find the Scaladocs API useful for the workshop.
You'll need to install sbt
, which you can use for all the exercises, or to bootstrap Eclipse. Installing sbt
isn't necessary if you use IntelliJ. See the sbt website for instructions on installing sbt
.
You were given a download link to a zip file with the exercises. Unzip it in a convenient work directory and then select pick from the following subsections depending on how you want to work with the exercises:
Open a terminal/console window and change to the working directory where you expanded the exercises. Run the sbt
command, which puts you at the sbt
prompt, then run the test
"task", which downloads all dependencies, compiles the main code and the test code, and then runs the tests. It should finish with a success message. Here are the steps, where $
is used as the "shell" or command prompt, >
is the sbt
prompt, and the #...
are comments you shouldn't type in:
$ sbt
[info] ... # Information messages as sbt starts
> test
...
[success] Total time: ... # Successfully compiled and tested the code.
Stay in sbt
for the rest of the workshop. You'll run all your commands from here. Open the exercises in your favorite text editor.
Since Eclipse plugins are "temperamental", we recommend downloading a complete Eclipse distribution with the Scala plugin installed from http://scala-ide.org. Also, we have found that the Scala 2.11.X version of the IDE has problems with the workshop project, so download the IDE for 2.10.X. However, this site also has the plugin URLs if you prefer trying that step first.
Unfortunately, a ScalaTest plugin is not included, and it appears that the "incubator" version is obsolete that's hosted on the Scala-IDE update site, http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site. Try the instructions on this scalatest.org page or use the sbt
console to run the tests.
You'll need to generate Eclipse project files using sbt
. (You can do this while Eclipse is doing its thing.) Open a terminal/console window and change to the working directory where you expanded the exercises. Run the following sbt
command to generate the project files:
sbt eclipse
It will take a few minutes, as it has to first download all the project dependencies.
Once it completes, start Eclipse and use the File > Import menu option, then use the dialog to import the project you just created.
Make sure you have installed the Scala plugin. If so, you can import the exercises as an SBT project.
There is also a README in the data
directory. It describes the data we'll use for the exercises and where it came from.
To learn more, see the following:
- The Apache Spark website.
- The Apache Spark documentation.
- The Apache Spark Quick Start. See also the examples in the Spark distribution and be sure to study the Scaladoc pages for key types such as
SparkContext,
SQLContext,
RDD, and
DataFrame`. - Talks from Spark Summit 2013, 2014, and 2015.
Experience Reports:
Other Spark Based Libraries:
- Spark Notebook - An interactive, web-based environment similar to ipython.
- Zeppelin - Another notebook.
- Snowplow's Spark Example Project.
- Thunder - Large-scale neural data analysis with Spark.
- See Typesafe Reactive Big Data to find other Activator templates.
- See Typesafe Activator to find other Activator templates.
- See Typesafe for more information about our products and services.