/spark_capstone

Repository for the Udacity Spark Capstone project

Primary LanguageHTMLMIT LicenseMIT

Sparkify Churn Prediction


Table of Contents


Project description

This project uses data from a fictional music streaming service called Sparkify. The purpose of this project is to use this data to predict user churn. Predicting whether a user will churn has high potential to prevent this from happening and increasing user retention.

The dataset is 12 gigabytes in size and it makes most sense to process this dataset using cloud computing. To reduce the cost of this project, there also is a smaller subset of data that can be processed locally.

The EDA and first model training as well as evaluation is performed on the small subset of data in the local.ipynb notebook. There is also a notebook provided called databricks.ipynb, to train one select model that performed best in the local evaluation on the full data set. This model will not be ran though.

You can find an in depth description of the project on medium: Predict churn in music streaming services

Result summary

With just very limited information about a user we were able to make a decent prediction on whether this user will churn soon or not. As it turns out, the Random Forest and Multilayer Perceptron models were best suited for this task. The MLP had a F-Score of 0.79, with no Type II errors. As for this business case, we prefer to have Type I over Type II errors, this is a satisfactory outcome.


Files in this project

  • notebooks/databricks.ipynb: Notebook to run on the full data set. Just contains the best performing model.
  • notebooks/local.ipynb: Notebook to run on the subset of data. Contains the full EDA and a comparison of all tested models.
  • .gitignore: Specifies which files and folders git should ignore
  • LICENSE: License file, specifying the MIT licence
  • local.html: HTML copy of the notebooks/local.ipynb file
  • databricks.html: HTML copy of the notebooks/databricks.ipynb file
  • README.md: Readme file
  • requirements.txt: Contains all python requirements to run this project.

Get it running

Dependencies

This project was written in Python 3.8.6 and Spark 3.0.1. You will need to install these packages with their dependencies:

  • Numpy
  • Pandas
  • Matplotlib.pyplot
  • Findspark
  • Pyspark

You can find a list of all dependencies in the requirements.txt file, as well as the version numbers used in this project. If you are using pip, you can install them with pip install -r requirements.txt from the root path of the project.


Additional Information

Author:

Robert Offner

License:

License: MIT

Acknowledgements:

  • Udacity for providing the outline of the project as well as the data set.