The motivation behind this project is to predict churn for the fictional music streaming platform Sparkify. This article provides a deep dive into a churn prediction task from both a business and a technical perspective. Using PySpark - a Python library for large data processing - and Google Cloud Dataproc - a fully managed and highly scalable service for running Apache Spark applications on the cloud- we are going to predict churn based on user logs of the fictional music streaming service Sparkify.
A blog post about this project incl. technical configuration is available on Medium.
From a business perspective, churn - which refers to customers that stop buying a company's product or using a company's service - can have a major negative impact on revenue. Companies are interested in using data to understand which customers are likely to churn and prevent them from churning. This is why churn prediction has become a common task for data scientists and analysts in any customer-facing business.
The underlying data basis was provided by Udacity and originally covers 26,259,199 events from 22,278 Sparkify users, such as playing the next song, adding a song to a playlist, or upgrading or downgrading the service. In addition, there is a tiny subset (128MB) of the full Sparkify data with 286,500 events from 225 Sparkify users and a medium-sized subset (462MB) with 1,087,410 events from 448 Sparkify users.
Please note that we will not work with the original Sparkify dataset (12GB) in this project but a medium-sized subset (462MB). This is because I wanted to implement the Sparkify pipeline on Google Cloud Platform (GCP). The original dataset is stored on Amazon's S3 located in s3n://udacity-dsnd/sparkify/sparkify_event_data.json
though and the way it is configured does not allow to easily transfer it to GCP. If so, the pipeline could easily be scaled up and re-executed with the original dataset.
Churn prediction is a classification task. In this case, there are multiple possible metrics:
- Accuracy (= valid choice of evaluation for classification problems which are well-balanced and not skewed)
- Precision (= valid choice of evaluation metric when we want to be very sure of our prediction)
- Recall (= valid choice of evaluation metric when we want to capture as many positives as possible)
- F1 Score (= valid choice of evaluation metric when we want to have a model with both good precision and recall)
In this project, we will optimize for F1 Score since the churned users are a fairly small subset.
The project contains three steps:
The code for loading the Sparkify data is present in load_data.ipynb
. First, it loads the medium-sized Sparkify dataset from video.udacity-data.com into the local directory using the curl
command below. Second, it loads the medium-sized dataset (462MB) and the mini dataset (128MB) which should be added manually to /data
into a Cloud Storage bucket.
The code for the EDA is present in run_exploratory_data_analysis.ipynb
. The necessary preprocessing steps identified in the EDA is stored in preprocessing_sparkify.py
so it can be reused in the Spark Machine Learning Pipeline.
The code for the production Machine Learning Pipeline is present in run_pipe_sparkify.ipynb
. Using the medium-sized dataset, the code takes about 3 hours to execute.
> git clone https://github.com/frederik-schmidt/Sparkify-churn.git
It is highly recommended to use virtual environments to install packages.
> conda create -n sparkify_churn python=3.8 jupyterlab
(where sparkify_churn
is the environment name of your choice)
> conda activate sparkify_churn
(or whatever environment name you used)
> cd Sparkify-churn
> pip install -r requirements.txt
The mini Sparkify dataset (128MB) can be downloaded from the Internet. It should be present in the /data
directory, because it gets loaded into a Cloud Storage bucket in load_data.ipynb
.
When the packages have been installed, everything is set up and ready to run the project steps described in section 2.
.
├── data
├── data_transfer.py
├── LICENSE
├── load_data.ipynb
├── preprocessing_sparkify.py
├── README.md
├── requirements.txt
├── run_exploratory_data_analysis.ipynb
└── run_pipe_sparkify.ipynb
The project uses Python 3.8 and additional libraries:
- datetime
- numpy
- os
- pyspark
- time