This repository is intended to serve as reference tool for those interested in learning about how to solve "big data problems" using PySpark. First, an example problem statement is presented and followed by a typical exploratory data analysis (EDA) workflow using tools such as Pandas, Matplotlib, and Scikit-learn. Finally, the work same workflow principles done with Pandas are converted over to a PySpark notebook and script, which can be run within an automated EMR cluster.
Some of the tools used:
- Docker (docker-compose)
- Jupyter's pyspark-notebook (https://hub.docker.com/r/jupyter/pyspark-notebook)
- AWS Elastic MapReduce Service (https://aws.amazon.com/emr/)
...