/Wrangling_PySpark

Primary LanguageJupyter NotebookMIT LicenseMIT

Wrangling_PySpark

This repo contains a python jupyter notebook and a simple soccer data for novice data scientists to get started on PySpark using their local machines (single node cluster).

I have come across several frustrating tutorials on PySpark promising to teach new PySpark users with simple to follow shortcuts in under five minutes. I have found them to be click baits and lack the necessary depth to get beginners started and keep them rolling. So, I decided to write an article on Towards Data Science - Medium in hopes of helping new PySpark users with a project-driven tutorial as opposed to showing code snippets and know-hows. Both the dataset and the code are included in this repo. Happy Learning!