Data Science Nanodegree Capstone Project - Sparkify
- Sparkify.ipynb - Jupyter Notebook with technical data manipulation and analysis.
- Sparkify_Blog_Post.ipynb - Jupyter Notebook for the blog post.
- HTML view of blog post is here
- pyspark for data manipulation and machine learning
- matplotlib and seaborn for data viz
Selected this project as a learning opportunity to skill up on PySpark, a technology for scalable data science that is widely used in industry today.
This project seeks to use machine learning to predict customer churn for a hypothetical music streaming service called Spotify.
Successfully completed a full end-to-end data preparation, modelling and optimisation exercise using PySpark. Gradient Boosted Trees emerged as the optimal model for predicting customer churn in this case.