CHURN PREDICTION FINAL PROJECTS

PROGRESS

Goal: 7-page Research Paper + 2-page executive summary

1. Executive Summary: 2 pages
2. Abstract: 0.5 pages
3. Introduction: 1 page
4. Literature Review: 2 pages
5. Data and Methodologies: 1 page
6. Results and Discussions: 3 pages
7. Conclusion: 0.5 pages
8. Reference/ Appendix: not count

Deadline:

- 18 Aug: Final paper presentation
- 23 Aug: Final paper submission

Assigning Task + Timeline

Task	Sub-task	Description	Assignment	Date	Expected Output
Hypothesizing the problem	Research questions	Raising problem and ask research questions which aim to contribute the up-to-date scientific studies and afterwards conduct the models	Team	Aug 14 2021	Several research questions and their contribution
Literature review	Collecting and Writing literature review	Focus the literature which discusses on our research problems, its pros and cons, and how our study overcomes/improves/differentiates their studies. We do at the same time with the team of data modeling to discuss together on how our study overcomes/improves/differentiates their studies	2 persons	Aug 16 2021	Essay
Exploratory data analysis	Univariate and Multivariate	Try to combine our knowledge in this course to conduct analyses and figure out some valuable insights. Those insights should use as a potential input for upcoming models	2 persons	Aug 15 2021	Essay
Data mining	After EDA, creating more useful variables and conduct several models	Try to code clean and clear	2 persons from EDA	Aug 16 2021	Output of model
Result and Discussion	Writing result and discussion	From the output, writing result	Team	Aug 17 2021	Essay
Conclusion and Introduction	Writing conclusion and introduction	From above outputs	Team	Aug 17 2021	Essay
Executive summary and Abstract	Writing executive summary and abstract	From above outputs	1 person	Aug 17 2021	Essay
Prepare PPT file for presentation	PPT preparing and Presenting		2 persons, team answer the questions from Prof.	Aug 18 2021	1 file PPT and Presenting via Skype
Document for submission	Writing paper	Adjust paper from the comments in presentation, customize the paper such as format, reference, citation, etc.	Team	Aug 23 2021	1 file PDF

Update Tool: GitHub for storing document and code files (Python and R), easy for sync and updating the info of code changes.

Exploratory Data Analysis

Please see EDA.html for further details. Some highlights as follows.

Imbalanced dataset: 84% existing customers
Among categorical variables the percentage of Attrited Customers seems to be fairly equal across all categories of all the variables. Gender and Income_Category clearly contribute to discriminating power. Other categorical variables need check further.
Detecting several continuous variables having large amount of outliers based on IQR rule. Majority of them follow non-Normal distribution. Several variables show remarkedly skewed to the right in their distributions. Some relationships show non-linearity. Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit, Total_Revolving_Bal, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio confirmedly have discrimination power.
Drop Avg_Open_To_Buy
Employ PCA for mixed data. Choosing 17 components for 71.5% of total variation. Need further interpretation.

Data mining:

We employ logistic regression to describe several aspect of data. We will facilitate some techniques to handle the data and compare their performance in Logistic regression model to the benchmark. The considered techniques are as follows.

Weight of Evidence transformation: aiming to handle non-linearity and outliers
PCA transformation: aiming to handle multicolinearity
SMOTE sampling: aiming to handle imbalance in the dataset
Benchmark model: Logistic regression without aforementioned techniques.

The Performance criteria should be: the area under the receiver operating characteristic curve (AUC). The AUC assesses the behavior of a classifier disregarding class distribution, classification cutoff and misclassification costs (10.1016/j.ejor.2011.09.031)

Model can make two kinds of wrong predictions:

Predicting that a customer will cancel their Credit Card services but doesnt : False Positive
Predicting that a customer wont cancel their Credit Card servicebut does : False Negative

The bank's objective is to identify all potential Customer's who wish to close their Credit Card Services. Predicting that customers won't cancel their Card Serivces but they do end up attriting, will lead to loss. Hence the False Negative values must be reduced Metric for Optimization in final model to choose the best cutoff probability. The Recall must be maximized to ensure lesser chances of False Negatives.

Please see Modeling.html for further details. Some highlights are as follows.

SMOTE does improve the benchmark performance in terms of both AUC and Recall.
WOE improves the benchmark performance in terms of only Recall.
PCA reduces the benchmark performance in terms of both AUC and Recall.

HieuNguyenPhi/DSC424-Final-Project

CHURN PREDICTION FINAL PROJECTS

PROGRESS

Exploratory Data Analysis

Data mining: