Analyze data associated with telecommunications customers in the San Francisco Bay area and predict churn using classification modeling.
San Bay Tel is a new telecommunications startup looking to establish themselves in the San Francisco Bay area.
San Bay Tel's main priority is retention while they establish themselves in the market. For this reason they have requested a model predicting the likelihood of customer churn.
Data associated telecommunications customers from the San Francisco Bay area, area codes:
- 415 City of San Francisco, Marin County, and the northeast corner of San Mateo County
- 408 City of San Jose, Santa Clara County, and northern Santa Cruz County
- 510 Contra Costa County and western Alameda County
Data sourced from Churn in Telecom's dataset by david_becks
Remove phone number, area code, and state
- Individual phone numbers will be unique to each customer
- Only three area codes are represented, all within the same defined geographic area
- State, indicates the customer's origin state, however, the current assumption is that they reside in the defined geographic area
Define Catergorical Data and Label Encode
Total Churn: 483
Percentage of Churn: 14.49 %
The following metrics are being calculated for each of the models tested:
-
ROC Score, the measurement of the Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC), a plot of True Positive Rate (TPR) v False Positive Rate (FPR)
-
Precision, the ratio of false negatives, instances correctly identified as churned
-
Recall, the ratio of true positives, instances correctly identified as not churned
-
F1 Score, the weighted average of the precision and recall values
Training and Test data metrics will be displayed to ensure overfitting is avoided.
The base metric used to evaluate the success of the models will be ROC Score, analyzing the predictive power through maximizing true positives and minimizing false positives.
San Bay Tel has requested a model predicting the likelihood of churn. Therefore, for the purposes of the this project the next highest valued metric will be Precision, the proportion of actual customers who churned correctly classified with F1 Score considered only as a third (much lessor) metric for comparison purposes between similarly performing models.
The following models were fitted:
- Logistic Regression Evaluation
- Ada Boost Evaluation
- K Neighbors Evaluation
- Decision Tree Evaluation
- Extra Tree Evaluation
- Gradient Boosting Evaluation
- Random Forest Evaluation
- XG Boost
Based on the metrics calculated and the valuation of the metrics stated based on the business problem, XG Boost performed best of all of the models tested
- ROC Score: 91.80%
- Precision: 95.56%
- F1 Score: 89.58%
Feature Importance indicates the most most significant features toward predicting customer churn are the features associated with minutes used by the customer, with the top three overall features being:
- Daytime Minutes
- Evening Minutes
- International Minutes
The visualization for Churn by Daytime Minutes indicates a skew churn toward the higher number of daytime minutes used.
This poses the question, is there a factor regarding the service which is negatively impacting customers who use more daytime minutes? Possible factors may be:
- General quality of service in the area
- Areas of lower quality
- Decreased quality at high use times during the day
The next most important feature not associated with minutes used (or, closely associated, number of calls) is Customer Service Calls.
The visual for Churn by Customer Service Calls indicates a skew for churn toward customer with greater than three customer service calls.
Further investigation reveals 51.68% of customers with greater than three customer service calls churn v 11.25% of customers with three or less.
This poses questions such as:
- What are the questions and/or issues customers are calling about?
- Can the individual questions and/or issues be resolved to a more satisfactory level (within reason of the overall business-model) so as to retain customers?
- Are there recurring questions and/or issues across customer service calls which can be addresses at a macro-level?
- Are customers satisfied with their interactions with customer service; what is customers' perception of the quality of customer service?
The next most important feature after Customer Service Calls is Number of Voicemail messages.
The visual for Churn by Number of Voicemail Messages indicates a much higher churn rate amongst customers with zero voicemails, which is consistent with Churn by Voicemail Mail, which indicates a significantly higher rate of churn amongst customers without a voicemail plan (who would therefore have zero voicemails).
Further investigation reveals 83.44% of customers who churn do not have a voicemail plan.
This poses the questions:
- Are customers aware of the voicemail plan?
- Are sales associates educating customers about the voicemail plan and encouraging it?
Following the findings of this investigation analyzing data associated with telecommunications customers in the San Francisco Bay area and predicting churn using classification modeling, the next steps towards further enhancing San Bay Tel's retention while they establish themselves in the market and limiting the likelihood of customer churn, would be:
Further Modeling
The most important features toward predicting churn were all associated with minutes, which is a continuous, numerical data type.
Other modeling types beyond classification models, may help provide further insight toward better predicting churn.
Customer Polling
Many of the questions posed in the conclusions can be further explored through targeted polling of customer most likely to churn.
Then further analysis and modelling can be done with the resulting data.
Investigation of Service Quality
An investigation may be warranted into the quality of service, due to customers with higher number of minutes used having a higher churn rate.
Rating factors such as connection strength by geographic location, connection strength by time of day, etc., can then be modelled against churn rate.
Logging of Customer Service Call Topic
Logging the topics of customer service calls may be warranted.
The resulting data can then be modeled against churn and other factors associated with the topic.