Predict Churn using the Udacity's Sparkify dataset
Udacity's Sparkify dataset | Predicting Churn | Apache Spark
Introduction and Motivation
This project is an example on how to tackle a large scale machine learning problem.
We have a large dataset composed of several user events in an audio streaming service provider like Spotify. The task is to detect if a specific user will cancel the service, and we will use their interaction with the platform to do so.
You can read the full story at this Medium post
Download the dataset
Libraries
Python libraries
pyspark
scikit-learn
jupyter
pandas
numpy
seaborn
Running
- Install Spark on your local machine or use a cloud service
- Install the python libraries and run the notebooks
- It was run using the Spark version 3.1.1
Files
Sparkify-Mini.ipynb
- A Jupyter Notebook running the analysis and model with the mini datasetSparkify-Full.ipynb
- A Jupyter Notebook running the analysis and model with the full datasetcoefficients.xlsx
- The coefficients of the Logistic Regression model trained on the full dataset
Summary of the results
This is the F1 score
of each classifier given the base feature dataframe and their modifications.
The best model for the full dataset is both Logistic Regression and Linear SVM.
Random Forest | GBT Classifier | Logistic Regression | Linear SVM | |
---|---|---|---|---|
Mini | 0.64 | 0.70 | 0.70 | 0.73 |
Mini with last week features | 0.64 | 0.63 | 0.70 | 0.74 |
Mini with last week features and weights | 0.73 | 0.63 | 0.74 | 0.76 |
Full with last week features and weights | 0.80 | 0.80 |
More stats for the Logistic Regression model on the full dataset
Precision | Recall | F1 Score | Support | |
---|---|---|---|---|
Not Canceled | 0.92 | 0.79 | 0.85 | 4376 |
Canceled | 0.53 | 0.78 | 0.63 | 1283 |
Total Weighted Avg. | 0.83 | 0.79 | 0.80 | 5659 |
The most relevant features were the user age, mean user age (which is the average of user events) and the user percentage interaction of each page over all pages. The least relevant were the hourly song counts and lenghts.
Full analysis
- The full analysis, feature selection and modeling can be found at our Medium post.