Sparkify: User churn prediction

Table of Contents

  1. Installation
  2. Project Motivation
  3. Folder Structure
  4. Feature Engineering
  5. Modelling
  6. Results
  7. Licensing, Authors, and Acknowledgements

Installation

Apart from Anaconda distribution of Python, this code requires pyspark either in standalone or in clustered environment for execution.

Project Motivation

Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. In this notebook, Sparkify mini dataset has been used to perform analysis on the contents of the data and further build a model based on spark ML libraries in order to predict user churn.

Folder Structure

  • Sparkify.ipynb

    • Containts all code for data cleaning, data exploration, modelling and conclusions.

Feature Engineering

Following features were used for the model

  • Average Session length
  • Number of Platforms used by the user
  • Number of artists
  • Number of Thumbs Up
  • NUmber of Thumbs Down
  • Number of Sessions
  • Number of days since registration
  • Gender
  • Platform
  • Level of subscription
  • Churn (label)
  • Downgraded

Modelling

Following models were tried based on the features that were created from the dataset after cleaning and exploration.

  • Logistic Regression
  • Gradient Boosting Trees
  • Random Forest Classifier

Out of the above models that were tried GBT performs the best, followed by RFC and LR models with 86%, 83% and 79% F1 scores respectively.

Results

The main findings can be found on the blog post here

Licensing, Authors, and Acknowledgements

License: MIT