
Data Science Nanodegree Capstone Project - Sparkify

Primary LanguageHTML


Data Science Nanodegree Capstone Project - Sparkify

Key Deliverables:

  • Sparkify.ipynb - Jupyter Notebook with technical data manipulation and analysis.
  • Sparkify_Blog_Post.ipynb - Jupyter Notebook for the blog post.
  • HTML view of blog post is here

Libraries Used

  • pyspark for data manipulation and machine learning
  • matplotlib and seaborn for data viz


Selected this project as a learning opportunity to skill up on PySpark, a technology for scalable data science that is widely used in industry today.


This project seeks to use machine learning to predict customer churn for a hypothetical music streaming service called Spotify.


Successfully completed a full end-to-end data preparation, modelling and optimisation exercise using PySpark. Gradient Boosted Trees emerged as the optimal model for predicting customer churn in this case.