/churn-analysis

Primary LanguageJupyter Notebook

Churn Analysis

The goal of this project is to analyze a manually degraded variant of the Customer Churn dataset found on Kaggle.

Data preparation

See 0-DataPrep.ipynb

For the data prep I focused on identifying data type inconsistencies, missing values, and getting a feel for the distributional differences of each variables between the training and testing dataset. Almost all discrepancies were found in the testing dataset.

I persisted the resulting cleaned up data in the following Parquet files:

Data exploration

See 1-ExploratoryDataAnalysis.ipynb

For this part my focus was to identify distributional associations between the Churn target variable and other variables. I focused almost exclusively on the training set, and derived supplemental ordinal variables from the specific thresholds that I was able to visually identify.

This proved helpful to develop a first idea of potential drivers of Churn:

  • Age
  • Gender
  • Support Calls
  • Payment Delays
  • Total Spend

I persisted the resulting data in the following Parquet files:

Modelling and retention campaign

See 2-PredictiveModel-NoBinning.ipynb

Following the preceding exploration, I was interested in producing an actual predictive model of Churn. The ambition with that effort was:

  • to use a type of model that can be easily interpreted
  • to use a model that is somewhat insensitive to outliers and does not require scaling features
  • to have the ability to control the complexity of the model (in our case, lower is better)
  • to further identify important features, and how they rank
  • to suggest retention actions based on those interpretations.

For that reason, I decided to use random forests and decision trees. I also tried to use logistic regression, Gradient boosted histograms, and a one-class Support Vector Machine, but none of them performed as well as the decision trees I trained (undocumented for the sake of brevity - please let me know if you want to see the notebook).

One interesting outcome is that some of the features that we had identified in phase 2 carried over as important features of our models.

To be noted:

  • I used the rfpimp package to provide measures of feature importance based on data permutations, as a potentially more robust approach that the default mean decrease in impurity (aka gini importance), used by scikit-learn.
  • I tried to use the categorical variables I had created in step 2 of this assignment but eventually realized they did not add much value (undocumented for the sake of brevity, again, please let me know if you want to see the notebook).
  • I used a random forest to train what is really a single decision tree, multiple times (for different random seeds). Most likely not a problem, but with more time I would use the actual DecisionTree class instead.