/Introduction-to-Apache-Spark

This repository offers basic materials explaining what Spark is and how it works. It also contains scripts explaining the basics. In future it might involve more scripts about Machine Learning with Spark

Primary LanguageJupyter NotebookMIT LicenseMIT

SIAW logo

Introduction-to-Apache-Spark

This repository offers basic materials explaining what Spark is and how it works. It also contains scripts explaining the basics. In future it might involve more scripts about Machine Learning with Spark

Explanation of the individual files:

  1. Introduction to Spark: A written explanation (8 pages pdf) of what Spark is, what we use it for and the most important concepts Spark takes advantage of.
  2. Installing PySpark: Explanation of the steps necessary to installing PySpark on your local machine
  3. Spark Basics: A Jupyter Notebook explaining how to write code in PySpark and explaining the most important functions
  4. Spark - Basic Machine Learning: A Jupyter Notebook which shows how to create a small machine learning pipeline with the PySpark ML package
  5. Setting up PySpark on GCP: As Spark is made for running bigdata some of you might be interested in running it on a cloud platform. This file explains how to set up a cluster for PySpark and interact with it via a Jupyter Notebook on Google Cloud Platform.

    If you have any questions about the code or the files included please feel free to reach out to me at: dominique.c.a.paul@gmail.com