Churn-Prediciton-At-Scale

The Customer Churn Analysis & Prediction At Scale (PySpark and Plotly)

Introduction
Installation
Project Results
Licensing, Authors, Acknowledgements

Introduction

Project Motivation

In the music streaming industry like Sportify or YoutubeMusic, it is very important to monitor important business metrics and predict how their customers response. One of the most important metrics is "Churn" in business, which means the ratio of the users cancelling/downgrading the premium subscriptions. In this project, I build the prediction model to understand what kinds of factors affect the churn decision of users.

Project Description

To process large amount of website-log data in website, which is not fit into my local computer, I need to understand how to work with the data at scale and build the prediction model with SparkMLlib. Thus, this project has three goals:

Analyze and Visualize the website logs
Build the ML model with SparkMLlib using mini-dataset(128MB)
Deploy the Spark cluster to AWS Cloud to train on the full dataset(12GB)

Installation

Dataset

The size of the mini-dataset is over 100MB so I uploaded the dataset with .zip format. Note that this dataset is owned by Udacity, so the full dataset (12GB) can be accessed on AWS only. Please unzip the file before use, for example, you can use the below command on Linux.

unzip mini_sparkify_event_data.zip

Environment Setup Using Conda Env

Below command is used on Linux

conda env create -f environment.yml
source activate spark

Pre-trained Models For Convenience

If you prefer to train yourself, just skip this part. Otherwise, please unzip all the models in the models before running the notebook. Below is the example command on Linux

unzip "models/*.zip"

Important Files

Sparkify.ipynb: the notebook contains all the code, working locally.
Sparkify.html: the snapshot of the "Sparkify.ipynb" with the results
Sparkify_udacity.ipynb: the default notebook provided by Udacity
models/ : the directory for the saved models from the notebook

Results

For the overview and general explanation of this project, please check my Medium Post: "How to Predict Customer Churn From Your Website Logs?"

If you want to play with my code online, you want to visit my Kaggle Post: "A Tutorial of Customer Churn Analysis & Prediction"

Licensing, Authors, Acknowledgements

This project is the part of the Udacity Data Scientist Nanodegree Program. The topic and dataset are given from the Udacity but the code and contents are written by myself. Plus, to create the pipline, I got some help from this blog

If you think that it is useful, please connect with me via linkedIn-Suhong

suhongkim/Churn-Prediciton-At-Scale