/Churn-Prediciton-At-Scale

EDA & Prediction of User Churs using ML on the big data of user log file from Sparkify(music streaming website)

Primary LanguageHTML

Churn-Prediciton-At-Scale

The Customer Churn Analysis & Prediction At Scale (PySpark and Plotly)

Table of Contents

Introduction

Project Motivation

In the music streaming industry like Sportify or YoutubeMusic, it is very important to monitor important business metrics and predict how their customers response. One of the most important metrics is "Churn" in business, which means the ratio of the users cancelling/downgrading the premium subscriptions. In this project, I build the prediction model to understand what kinds of factors affect the churn decision of users.

Project Description

To process large amount of website-log data in website, which is not fit into my local computer, I need to understand how to work with the data at scale and build the prediction model with SparkMLlib. Thus, this project has three goals:

  • Analyze and Visualize the website logs
  • Build the ML model with SparkMLlib using mini-dataset(128MB)
  • Deploy the Spark cluster to AWS Cloud to train on the full dataset(12GB)

Installation

Dataset

The size of the mini-dataset is over 100MB so I uploaded the dataset with .zip format. Note that this dataset is owned by Udacity, so the full dataset (12GB) can be accessed on AWS only. Please unzip the file before use, for example, you can use the below command on Linux.

unzip mini_sparkify_event_data.zip

Environment Setup Using Conda Env

Below command is used on Linux

conda env create -f environment.yml
source activate spark

Pre-trained Models For Convenience

If you prefer to train yourself, just skip this part. Otherwise, please unzip all the models in the models before running the notebook. Below is the example command on Linux

unzip "models/*.zip"

Important Files

Results

Markdown Monster icon Markdown Monster icon

Licensing, Authors, Acknowledgements

This project is the part of the Udacity Data Scientist Nanodegree Program. The topic and dataset are given from the Udacity but the code and contents are written by myself. Plus, to create the pipline, I got some help from this blog

If you think that it is useful, please connect with me via linkedIn-Suhong