Customer Churn Prediction - Machine Learning

Problem Statement

Customer Churn prediction using machine learning. The objective is to test out various classical machine learning algorithms present in order to predict customer churn accurately. It also tries to exhaustively compare algorithms and the effects of data refining on similar algorithms.

Keywords - Customer Churn, Classification, Prediction, Logistic Regression, Support Vector Machine, Naive Bayes Classification

Introduction and Methodology

Customer churn, also referred to as subscriber churn or logo churn, refers to the proportion of subscribers who terminate their subscriptions and is commonly expressed as a percentage. Customer churn prediction and analysis is one of the foremost and widespread applications of classical machine learning. Customer churn is a critical metric that can display customer satisfaction at the macro scale. Additionally, the telecom sector generally sees more significant churn rates than other sectors. This creates a large-scale requirement for better prediction models.

For the purpose of training the model, the following was implemented in sequential order:

  • Data cleaning: On checking for duplicate and missing values, we found the data accurate and consistent.
  • Exploratory Data Analysis and Data Preprocessing: Conversion of categorical features to numerical features. Trend analysis of each feature with churn rate (y). Data unit conversion where required.
  • Correlation: Correlation matrix to find linear relationships between two variables.
  • Data preprocessing is done and encoding is done.
  • Generalized Linear Model: Relations between predictor variables and response variables devised based on the p-values.
  • Feature Scaling: Used to standardise the independent features within a fixed range.
  • Classification Models For the four models we have used, the approaches are as follows:
  1. Binary Logistic Regression
  2. Support Vector Machine (SVM)
  3. Naive Bayes Classifier
  4. Random Forest Classifier
  • SMOTE Analysis was done for data balancing.
  • Features selection on the basis of correaltion matrix and Principal Component Analysis
  • Confusion matrix and accuracy, precision, f1 score and recall were used for model analysis
  • Naive Bayes from scratch is observed
  • Logistic Regression is analysed by changing parameters and specifications.
  • SVM is anaylsed by changing its parameters and choosing the optimal one using GridSearch
  • ROC-AUC curve plot are made for analysis.

Results

Before and After Data Balancing:

image

Feature selection based on correlation and PCA: image

The confusion matrix of models: image

i2

Analysis on Logistic Regression on the basis of parameters and specifications: image

Analysis on SVM on the basis of different parameters and finding the optimal paramters using Grid Search:

image

image

ROC-AUC Curve:

image

image

image

Conclusion

  • Smote Analysis was quite effective in our case as we had imbalance in the churn data.
  • In the accuracy results of Naive Bayes -> increasing trend
  • In the accuracy results of Logistic Regression -> decreasing trend
  • After PCA, The final values of accuracy, f1 score, and precision had less impact.
  • For logistic regression, loss function + gradient descent works better.
  • For the SVC model, optimal parameters: linear kernel, C=1 & gamma=0.1 are used.
  • The AUC value of random forest was maximum (AUC=0.69).

References