/pyspark-aws-example

Example problem using PySpark, AWS EMR, and Jupyter Notebook

Primary LanguageJupyter Notebook

PySpark Example Problem

Overview

This repository is intended to serve as reference tool for those interested in learning about how to solve "big data problems" using PySpark. First, an example problem statement is presented and followed by a typical exploratory data analysis (EDA) workflow using tools such as Pandas, Matplotlib, and Scikit-learn. Finally, the work same workflow principles done with Pandas are converted over to a PySpark notebook and script, which can be run within an automated EMR cluster.

Some of the tools used:

Problem Walkthrough

...