Predict Churn using the Udacity's Sparkify dataset

Udacity's Sparkify dataset | Predicting Churn | Apache Spark

Introduction and Motivation

This project is an example on how to tackle a large scale machine learning problem.

We have a large dataset composed of several user events in an audio streaming service provider like Spotify. The task is to detect if a specific user will cancel the service, and we will use their interaction with the platform to do so.

You can read the full story at this Medium post

Download the dataset

The mini dataset can be obtained here.
The full dataset can be obtained here.

Libraries

Python libraries

pyspark
scikit-learn
jupyter
pandas
numpy
seaborn

Running

Install Spark on your local machine or use a cloud service
Install the python libraries and run the notebooks
It was run using the Spark version 3.1.1

Files

Sparkify-Mini.ipynb - A Jupyter Notebook running the analysis and model with the mini dataset
Sparkify-Full.ipynb - A Jupyter Notebook running the analysis and model with the full dataset
coefficients.xlsx - The coefficients of the Logistic Regression model trained on the full dataset

Summary of the results

This is the F1 score of each classifier given the base feature dataframe and their modifications. The best model for the full dataset is both Logistic Regression and Linear SVM.

	Random Forest	GBT Classifier	Logistic Regression	Linear SVM
Mini	0.64	0.70	0.70	0.73
Mini with last week features	0.64	0.63	0.70	0.74
Mini with last week features and weights	0.73	0.63	0.74	0.76
Full with last week features and weights			0.80	0.80

More stats for the Logistic Regression model on the full dataset

	Precision	Recall	F1 Score	Support
Not Canceled	0.92	0.79	0.85	4376
Canceled	0.53	0.78	0.63	1283
Total Weighted Avg.	0.83	0.79	0.80	5659

The most relevant features were the user age, mean user age (which is the average of user events) and the user percentage interaction of each page over all pages. The least relevant were the hourly song counts and lenghts.

Full analysis

The full analysis, feature selection and modeling can be found at our Medium post.

reneoctavio/sparkify