CHURN PREDICTION FINAL PROJECTS
PROGRESS
Goal: 7-page Research Paper + 2-page executive summary
1. Executive Summary: 2 pages
2. Abstract: 0.5 pages
3. Introduction: 1 page
4. Literature Review: 2 pages
5. Data and Methodologies: 1 page
6. Results and Discussions: 3 pages
7. Conclusion: 0.5 pages
8. Reference/ Appendix: not count
Deadline:
- 18 Aug: Final paper presentation
- 23 Aug: Final paper submission
Assigning Task + Timeline
Task | Sub-task | Description | Assignment | Date | Expected Output |
---|---|---|---|---|---|
Hypothesizing the problem | Research questions | Raising problem and ask research questions which aim to contribute the up-to-date scientific studies and afterwards conduct the models | Team | Aug 14 2021 | Several research questions and their contribution |
Literature review | Collecting and Writing literature review | Focus the literature which discusses on our research problems, its pros and cons, and how our study overcomes/improves/differentiates their studies. We do at the same time with the team of data modeling to discuss together on how our study overcomes/improves/differentiates their studies | 2 persons | Aug 16 2021 | Essay |
Exploratory data analysis | Univariate and Multivariate | Try to combine our knowledge in this course to conduct analyses and figure out some valuable insights. Those insights should use as a potential input for upcoming models | 2 persons | Aug 15 2021 | Essay |
Data mining | After EDA, creating more useful variables and conduct several models | Try to code clean and clear | 2 persons from EDA | Aug 16 2021 | Output of model |
Result and Discussion | Writing result and discussion | From the output, writing result | Team | Aug 17 2021 | Essay |
Conclusion and Introduction | Writing conclusion and introduction | From above outputs | Team | Aug 17 2021 | Essay |
Executive summary and Abstract | Writing executive summary and abstract | From above outputs | 1 person | Aug 17 2021 | Essay |
Prepare PPT file for presentation | PPT preparing and Presenting | 2 persons, team answer the questions from Prof. | Aug 18 2021 | 1 file PPT and Presenting via Skype | |
Document for submission | Writing paper | Adjust paper from the comments in presentation, customize the paper such as format, reference, citation, etc. | Team | Aug 23 2021 | 1 file PDF |
Update Tool: GitHub for storing document and code files (Python and R), easy for sync and updating the info of code changes.
Exploratory Data Analysis
Please see EDA.html
for further details. Some highlights as follows.
- Imbalanced dataset: 84% existing customers
- Among categorical variables the percentage of Attrited Customers seems to be fairly equal across all categories of all the variables.
Gender
andIncome_Category
clearly contribute to discriminating power. Other categorical variables need check further. - Detecting several continuous variables having large amount of outliers based on IQR rule. Majority of them follow non-Normal distribution. Several variables show remarkedly skewed to the right in their distributions. Some relationships show non-linearity.
Total_Relationship_Count
,Months_Inactive_12_mon
,Contacts_Count_12_mon
,Credit_Limit
,Total_Revolving_Bal
,Total_Amt_Chng_Q4_Q1
,Total_Trans_Amt
,Total_Trans_Ct
,Total_Ct_Chng_Q4_Q1
,Avg_Utilization_Ratio
confirmedly have discrimination power. - Drop
Avg_Open_To_Buy
- Employ PCA for mixed data. Choosing 17 components for 71.5% of total variation. Need further interpretation.
Data mining:
We employ logistic regression to describe several aspect of data. We will facilitate some techniques to handle the data and compare their performance in Logistic regression model to the benchmark. The considered techniques are as follows.
- Weight of Evidence transformation: aiming to handle non-linearity and outliers
- PCA transformation: aiming to handle multicolinearity
- SMOTE sampling: aiming to handle imbalance in the dataset
- Benchmark model: Logistic regression without aforementioned techniques.
The Performance criteria should be: the area under the receiver operating characteristic curve (AUC). The AUC assesses the behavior of a classifier disregarding class distribution, classification cutoff and misclassification costs (10.1016/j.ejor.2011.09.031)
Model can make two kinds of wrong predictions:
- Predicting that a customer will cancel their Credit Card services but doesnt : False Positive
- Predicting that a customer wont cancel their Credit Card servicebut does : False Negative
The bank's objective is to identify all potential Customer's who wish to close their Credit Card Services. Predicting that customers won't cancel their Card Serivces but they do end up attriting, will lead to loss. Hence the False Negative values must be reduced Metric for Optimization in final model to choose the best cutoff probability. The Recall must be maximized to ensure lesser chances of False Negatives.
Please see Modeling.html
for further details. Some highlights are as follows.
- SMOTE does improve the benchmark performance in terms of both AUC and Recall.
- WOE improves the benchmark performance in terms of only Recall.
- PCA reduces the benchmark performance in terms of both AUC and Recall.