/Churn_prediction_Sparkify

Predicting music streaming service user churn on local machine and AWS EMR using Apache Spark.

Primary LanguageHTMLMIT LicenseMIT

Churn Prediction Sparkify

Overview

Predicting music streaming service user churn on local machine and AWS EMR.

User churn (cancellation) prediction is an imperative predictive tool. This project sets to solve this problem for a music streaming service: Sparkify. By exploring Sparkify usage data, the project identifies features for model learning. For computation efficiency reason, a tiny dataset (240Mb), a sample of the full dataset (12Gb) is used for initial data exploration, feature engineering and modelling experimentation on a local machine (Workflow in Sparkify_local.ipynb)

The initial work on the tiny dataset shall identify the most suitable model and hyper-parameters for the full dataset to train the final model . Once features and model are identified, they will be used for modelling the full dataset on AWS EMR.(Workflow in Sparkify_AWS_EMR.ipynb)

The actionable insight gained from churn prediction would be to identify users who are likely to churn and send them offers that hopefully will keep them from clicking cancellation confirmation.

My Medium post provides a more detailed explanation of this project.

Requirements:

  1. Python 3
  2. Pyspark
  3. Pandas
  4. Matplotlib
  5. Seaborn
  6. Jupyter notebook

Instructions:

Data:
Tiny 240Mb
Big 12Gb: s3a://udacity-dsnd/sparkify/sparkify_event_data.json)

  1. To run Sparkify_local.ipynb, simply run it in Jupyter notebook.
  2. To run Sparkify_AWS_EMR.ipynb: Spin up an AWS EMR cluster, create the Sparkify_AWS_EMR.ipynb notebook.

Results

Exploratory data analysis

User Churn Status

Daily evolution of User Churn Status

Training results

Gradient Boosting Tree Classifier feature importance

All features are adopted for modelling the large dataset.

Test data evaluation results

On tiny data set on local machine:
Evaluation result:
+---------+------+------+--------+
|precision|recall| f1|accuracy|
+---------+------+------+--------+
| 0.8611|0.8662|0.8622| 0.8662|
+---------+------+------+--------+

On large dataset on AWS:
Evaluation result:
+------+------+---------+--------+
| f1|recall|precision|accuracy|
+------+------+---------+--------+
|0.7254|0.7868| 0.7444| 0.7908|
+------+------+---------+--------+