Building Spark Applications Live Lessons

This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

Materials

The corresponding videos can be found on the following sites for purchase:

In addition to the videos there are many other resources to provide you support in learning this new technology:

And/or please do not hesitate to reach out to me directly via email at jondinu@gmail.com or over twitter @clearspandex

If you find any errors in the code or materials, please open a Github issue in this repository

Skill Level

Beginning/Intermediate

What You Will Learn

How to install and set up a Spark environment locally and on a cluster
The differences between and the strengths of the Python, R, and SQL programming interfaces
How to build a machine learning model for text
Common data science use cases that Spark is especially well-suited to solve
How to tune a Spark application for performance
The internals of the Spark framework and its execution model
How to use Spark in a data science application workflow
The basics of the larger Spark ecosystem

Who Should Take This Course

Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.

Prerequisites

Basic understanding of programming (Python a plus).
Familiarity with the data science process and machine learning are a plus.

Setup

SparkR with a Notebook

Install IRKernel

install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()

Set environment variables:

# Example: Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/Users/[username]/spark")

# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

# if these two lines work, you are all set
library(SparkR)
sc <- sparkR.init(master="local")

Data

IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

In the IPython console type sales.[TAB]
Autocomplete will show you all the methods that are available.
To find more information about a specific method, say .cov type help(sales.cov)
This will display the API documentation for that method.

Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

Go to https://spark.apache.org/docs/latest
Navigate using tabs to the following areas in particular.
Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.
Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.
More > Configuration, Monitoring, Tuning Guide.

vksvicky/building-spark-applications-live-lessons