This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.
The corresponding videos can be found on the following sites for purchase:
In addition to the videos there are many other resources to provide you support in learning this new technology:
And/or please do not hesitate to reach out to me directly via email at jondinu@gmail.com or over twitter @clearspandex
If you find any errors in the code or materials, please open a Github issue in this repository
Beginning/Intermediate
- How to install and set up a Spark environment locally and on a cluster
- The differences between and the strengths of the Python, R, and SQL programming interfaces
- How to build a machine learning model for text
- Common data science use cases that Spark is especially well-suited to solve
- How to tune a Spark application for performance
- The internals of the Spark framework and its execution model
- How to use Spark in a data science application workflow
- The basics of the larger Spark ecosystem
- Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
- Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.
- Basic understanding of programming (Python a plus).
- Familiarity with the data science process and machine learning are a plus.
- Install IRKernel
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))
IRkernel::installspec()
# Example: Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/Users/[username]/spark")
# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
# if these two lines work, you are all set
library(SparkR)
sc <- sparkR.init(master="local")
Q: How can I find out all the methods that are available on DataFrame?
-
In the IPython console type
sales.[TAB]
-
Autocomplete will show you all the methods that are available.
-
To find more information about a specific method, say
.cov
typehelp(sales.cov)
-
This will display the API documentation for that method.
Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?
-
Navigate using tabs to the following areas in particular.
-
Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.
-
Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.
-
More > Configuration, Monitoring, Tuning Guide.