
Distributed ML: Predicting Churn from Click Data with Apache Spark

Predicting Churn - Apache Spark

Predicting Churn from user click-level data with distributed computation

  • Python 3.6.3
  • PySpark 2.3.2
  • Pandas 0.20.3
  • Matplotlib 3.0.2
  • seaborn 0.9.0


I am always eager to learn new frameworks and expand my capabilities, so when I heard about the possibility of a project utilized Apache Spark and Hadoop I was already very intrigued. Having learned the basics of Apache Spark's PySpark API, there is no better way of displaying machine learning prowess than in a big data context. This project revolves around a key business issue that many firms face; How can we know which customers want to leave, and how can our marketing department target them? Business applications are what excites me the most about Data Science. Proving that I can glean valuable insights from corporate-sized data sources would prove to me that I can say Big Data as more than just a buzzword.

Repository Organization

│   ├── Sparkify.ipynb            # initial development & EDA on smaller subset of data
│   ├── Sparkify-Viz.ipynb        # visualization of datafrom EDA and AWS
│   ├── sparkify_full.ipynb       # AWS Implementation and final version
│   └── sparkify_full.html        # final version of notebook in html format
└── ...


The final chosen model is a Random Forest classifier which was chosen due to it being the fastest model to train and generate predictions by half being twice as fast as the next fastest model. The classifier that predicted on a dataset of reduced dimensionality still managed exceptional performance. The final dataset was of reduced dimensionality through principal component analysis that explained >97% of the variance in the dataset.

The most interesting part of examining the features of the model is that it actually uses a much simpler version of the random forest classifier than the OOTB version. Instead of creating 32 decision trees that classify each instance, it only creates 10 during training, and lets the trees vote during prediction. The fact that this model did not make mistakes in validation or testing datasets suggests that it is a robust model.


Errors & Exceptions Encountered

I encountered various errors when using AWS's EMR managed Hadoop framework that I could only attribute to errors in the backend and not with my code. What complicated manners was that the AWS hosted notebook would crash reasonably frequently.

The errors are listed below for future reference:

