portfolio project
- dataset
- requirement file
- bank_churn_final_cycle
- pre-processing experiment file
- pikle file
- api file
- data preparation file
- business problem identification
- creation and validation of hypotheses
- data cleaning and attribute engineering if necessary
- feature selection
- testing machine learning models with cross validation
- hyperparameters
- metrics
- model in production, which will receive a customer base via API and will return this same “scored” base, that is, an extra column with the probability of each customer churn.
- What is TopBank's current Churn rate?
- What is the model's performance in classifying customers as churns?
- What is the expected return, in terms of revenue, if the company uses its model to avoid churn from customers?
features that were removed to improve the performance of the model
feature selection techniques were used to verify if there is a combination of features that would make the model perform better. the file with the procedure is in the link
- RowNumber,
- CustomerId
- Surname
pre-processing techniques used in this process
BINARINZING SOME CATEGORICAL COLUMNS
- Gender
- Geography
- KNeighborsClassifier
- DecisionTreeClassifier
- GaussianNB
- LogisticRegression
- GaussianNB
- SGDClassifier
- RandomForestClassifier
TRAINING AND TESTING THE MODEL WITH CROSS VALIDATION TO SEE WHICH IS THE BEST ONE
KNeighborsClassifier() score: 0.6881203380961249
DecisionTreeClassifier() score: 0.7870822603994517
GaussianNB() score: 0.720647455853267
LogisticRegression() score: 0.6742443638327416
GaussianNB() score: 0.720647455853267
SGDClassifier() score: 0.48517883873816076
RandomForestClassifier() score: 0.858099377082428
Hyperparameters were used to try to improve the performance of the model
BEST PARAMETERS {'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 300}
precision recall f1-score support
NO CHURN 0.88 0.85 0.86 1633
CHURN 0.85 0.87 0.86 1553
accuracy 0.86 3186
macro avg 0.86 0.86 0.86 3186 weighted avg 0.86 0.86 0.86 3186
- the API file for classification is in the directory
- it is necessary to run it before sending the data for classification
amount that the bank loses annually with the annual churn of customers with higher-than-average salaries - 31.373,947
amount that the bank loses annually with the annual churn of customers with salaries lower than the average - 7.472,377
amount the bank lost annually from all churn customers - 38.846,324
HOW MUCH WILL THIS DECREASE WITH THE USE OF THE MODEL (5% decrease which is the difference between the model's accuracy and the current % of churn- 1.942,316
HOW MUCH THE BANK WILL CONTINUE TO LOSE WITH THE MODEL - 36.904,008 (STILL STUDYING HOW TO IMPROVE THE MODEL AND REDUCE THIS LOSS)