Workshop: An Intorduction to Apache Spark - 101

This workshop was initially created for the DevFest 2017 in Prague. When you will go thought all tasks and intro presentation you should know of the basics architecture in the Apache Spark. You will know differences between MapReduce and Spark approaches, between batch and stream data processing. You will be able to start Spark job in the standalone cluster and work with basic Spark API.

Set the environment

As the first step you have to set our Spark environment to get everything work. Here are prepared few instruction to do that. It includes docker installation and description how to run docker container with Apache Spark.

Task 0: The First Run of Spark

Get to know the Spark, Spark REPL and run our first job.

scala: link
java: link

Task 1: Word-count

You will write our first Spark application. The word-count is the "hello world" in the distribution computation.

scala: link
java: link

Task 2: Analyzing Flight Delays

You will analyze real data with help RDD and Dataset.

scala: link
java: link

Optional: Run all spark jobs in the cluster

You can submit and run all app in cluster deploy mode on standalone cluster.

scala: link
java: link

Recommendation for further reading: Spark: The Definitive Guide