/architect_big_data_solutions_with_spark

code, labs and lectures for the course

Primary LanguageJupyter NotebookMIT LicenseMIT

Architect Big Data Solutions with Apache Spark


Introduction

This repository contains lectures and codes for the course that aims to provide a gentle introduction on how to build distributed big data pipelines with the help of Apache Spark. Apache Spark is an open-source data processing engine for engineers and analysts that includes an optimized general execution runtime and a set of standard libraries for building data pipelines, advanced algorithms, and more. Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

In this course we will learn the architectural and practical part of using Apache Spark to implement big data solutions. We will use the Spark Core, SparkSQL, Spark Streaming, and Spark ML to implement different advanced analytics and machine learning algorithms in a production like data pipeline. This course will master your skills in designing solutions for common Big Data tasks such as creating batch and real-time data processing pipelines, doing machine learning at scale, deploying machine learning models into a production environment, and much more!


Content

  1. Introduction [lecture 1] [labs] [pyspark Python cheat sheet]
  2. SQL and DataFrame [labs] [pyspark SQL cheat sheet]
  3. Batch Processing [lecture 2] [lecture 3]
  4. Stream Processing [lecture 4] [lecture 5] [labs]
  5. Machine Learning [lecture 6] [labs]

Computational Resources

  1. Please register for community version of DataBricks here.
  2. Please register for free tier AWS account here

Data Sources

You can find data and additional information from the links below:

  1. MovieLens DataSet
  2. House Prices: Advanced Regression Techniques
  3. Titanic: Machine Learning from Disaster

Note: For you convenience data already downloaded to Datasets folder of this repository.

Note: You can upload data to DataBricks directly or use AWS S3 bucket for storage:


Additional Resources

We provide links for nice cheat sheets and books in order to make course as smooth as possible:

  1. A Gentle Introduction to Apache Spark
  2. How to import Data to DataBricks using S3
  3. Python Cheat Sheet
  4. Machine Learning Tutorial for AWS
  5. DataBricks Development Documentation
  6. Developers Guide for AWS Machine Learning
  7. Superset

Course Initiative:

If you like the initiative please star/fork that repository and feel free to contribute with pull requests.


Places where this course has been taught (physically)