- Installation
- Project Motivation
- Folder Structure
- Feature Engineering
- Modelling
- Results
- Licensing, Authors, and Acknowledgements
Apart from Anaconda distribution of Python, this code requires pyspark either in standalone or in clustered environment for execution.
Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. In this notebook, Sparkify mini dataset has been used to perform analysis on the contents of the data and further build a model based on spark ML libraries in order to predict user churn.
-
Sparkify.ipynb
- Containts all code for data cleaning, data exploration, modelling and conclusions.
Following features were used for the model
- Average Session length
- Number of Platforms used by the user
- Number of artists
- Number of Thumbs Up
- NUmber of Thumbs Down
- Number of Sessions
- Number of days since registration
- Gender
- Platform
- Level of subscription
- Churn (label)
- Downgraded
Following models were tried based on the features that were created from the dataset after cleaning and exploration.
- Logistic Regression
- Gradient Boosting Trees
- Random Forest Classifier
Out of the above models that were tried GBT performs the best, followed by RFC and LR models with 86%, 83% and 79% F1 scores respectively.
The main findings can be found on the blog post here