/DSND-Sparkify

Data Science Nanodegree Capstone Project - Sparkify

Primary LanguageHTML

dsnd_sparkify

Data Science Nanodegree Capstone Project - Sparkify

Key Deliverables:

  • Sparkify.ipynb - Jupyter Notebook with technical data manipulation and analysis.
  • Sparkify_Blog_Post.ipynb - Jupyter Notebook for the blog post.
  • HTML view of blog post is here

Libraries Used

  • pyspark for data manipulation and machine learning
  • matplotlib and seaborn for data viz

Motivation

Selected this project as a learning opportunity to skill up on PySpark, a technology for scalable data science that is widely used in industry today.

Purpose:

This project seeks to use machine learning to predict customer churn for a hypothetical music streaming service called Spotify.

Summary:

Successfully completed a full end-to-end data preparation, modelling and optimisation exercise using PySpark. Gradient Boosted Trees emerged as the optimal model for predicting customer churn in this case.