Churn Prediction Sparkify
Overview
Predicting music streaming service user churn on local machine and AWS EMR.
User churn (cancellation) prediction is an imperative predictive tool. This project sets to solve this problem for a music streaming service: Sparkify. By exploring Sparkify usage data, the project identifies features for model learning. For computation efficiency reason, a tiny dataset (240Mb), a sample of the full dataset (12Gb) is used for initial data exploration, feature engineering and modelling experimentation on a local machine (Workflow in Sparkify_local.ipynb)
The initial work on the tiny dataset shall identify the most suitable model and hyper-parameters for the full dataset to train the final model . Once features and model are identified, they will be used for modelling the full dataset on AWS EMR.(Workflow in Sparkify_AWS_EMR.ipynb)
The actionable insight gained from churn prediction would be to identify users who are likely to churn and send them offers that hopefully will keep them from clicking cancellation confirmation.
My Medium post provides a more detailed explanation of this project.
Requirements:
- Python 3
- Pyspark
- Pandas
- Matplotlib
- Seaborn
- Jupyter notebook
Instructions:
Data:
Tiny 240Mb
Big 12Gb: s3a://udacity-dsnd/sparkify/sparkify_event_data.json)
- To run Sparkify_local.ipynb, simply run it in Jupyter notebook.
- To run Sparkify_AWS_EMR.ipynb: Spin up an AWS EMR cluster, create the Sparkify_AWS_EMR.ipynb notebook.
Results
Exploratory data analysis
Training results
All features are adopted for modelling the large dataset.
Test data evaluation results
On tiny data set on local machine:
Evaluation result:
+---------+------+------+--------+
|precision|recall| f1|accuracy|
+---------+------+------+--------+
| 0.8611|0.8662|0.8622| 0.8662|
+---------+------+------+--------+
On large dataset on AWS:
Evaluation result:
+------+------+---------+--------+
| f1|recall|precision|accuracy|
+------+------+---------+--------+
|0.7254|0.7868| 0.7444| 0.7908|
+------+------+---------+--------+