/sparkify

Udacity's Sparkify dataset | Predicting Churn | Apache Spark

Primary LanguageJupyter NotebookMIT LicenseMIT

Predict Churn using the Udacity's Sparkify dataset

Udacity's Sparkify dataset | Predicting Churn | Apache Spark

Introduction and Motivation

This project is an example on how to tackle a large scale machine learning problem.

We have a large dataset composed of several user events in an audio streaming service provider like Spotify. The task is to detect if a specific user will cancel the service, and we will use their interaction with the platform to do so.

You can read the full story at this Medium post

Download the dataset

  • The mini dataset can be obtained here.
  • The full dataset can be obtained here.

Libraries

Python libraries

  • pyspark
  • scikit-learn
  • jupyter
  • pandas
  • numpy
  • seaborn

Running

  • Install Spark on your local machine or use a cloud service
  • Install the python libraries and run the notebooks
  • It was run using the Spark version 3.1.1

Files

  • Sparkify-Mini.ipynb - A Jupyter Notebook running the analysis and model with the mini dataset
  • Sparkify-Full.ipynb - A Jupyter Notebook running the analysis and model with the full dataset
  • coefficients.xlsx - The coefficients of the Logistic Regression model trained on the full dataset

Summary of the results

This is the F1 score of each classifier given the base feature dataframe and their modifications. The best model for the full dataset is both Logistic Regression and Linear SVM.

Random Forest GBT Classifier Logistic Regression Linear SVM
Mini 0.64 0.70 0.70 0.73
Mini with last week features 0.64 0.63 0.70 0.74
Mini with last week features and weights 0.73 0.63 0.74 0.76
Full with last week features and weights 0.80 0.80

More stats for the Logistic Regression model on the full dataset

Precision Recall F1 Score Support
Not Canceled 0.92 0.79 0.85 4376
Canceled 0.53 0.78 0.63 1283
Total Weighted Avg. 0.83 0.79 0.80 5659

The most relevant features were the user age, mean user age (which is the average of user events) and the user percentage interaction of each page over all pages. The least relevant were the hourly song counts and lenghts.

Full analysis

  • The full analysis, feature selection and modeling can be found at our Medium post.

Acknowledgements