This workshop was initially created for the DevFest 2017 in Prague. When you will go thought all tasks and intro presentation you should know of the basics architecture in the Apache Spark. You will know differences between MapReduce and Spark approaches, between batch and stream data processing. You will be able to start Spark job in the standalone cluster and work with basic Spark API.
As the first step you have to set our Spark environment to get everything work. Here are prepared few instruction to do that. It includes docker installation and description how to run docker container with Apache Spark.
Get to know the Spark, Spark REPL and run our first job.
You will write our first Spark application. The word-count is the "hello world" in the distribution computation.
You will analyze real data with help RDD and Dataset.
You can submit and run all app in cluster deploy mode on standalone cluster.
Recommendation for further reading: Spark: The Definitive Guide