Bank-Customer-Prediction: A Jupyter Notebook repository from reyrobs

Bank Customer Prediction

Table of Contents

Genereal idea and approach
Data Exploration
Results with imbalanced dataset
Results with balanced dataset
Interpretation of results
License
Contact
References

Genereal idea and approach

The purpose of this project is to explore the best way to classify a dataset consisting of bank customers on whether or not they plan to leave the bank. Each customer is attributed a set of features such as age, salary, location etc.. A big issue with this dataset is the fact that it is imbalanced. Therefore we explore various ways to tackle such a problem, more of which will be talked about below. The classifier we have chosen for this task is a simple ANN consisting of 4 layers, with input of 12 neurons since each customer has a set of 12 features.

(back to top)

Data Exploration

The dataset used can be found on Kaggle through this link [1]. The original data consists of 10000 entries, with 14 columns (or features each). We only make use of 11 of them, since we don't deem the others useful for our classification. We have included 5 samples from the dataset below.

It is noteworthy to see that the dataset is imbalanced. A problem which we will tackle further on. Please refer to the histogram below to see the imbalance between both classes.

We have done further data exploration analysis, in order to try and find some useful information in the dataset which would help us with our classification. Given that the data is imbalanced, it was hard to find such information. Our results from the data analysis can be found below.

The main information that we can take from this analysis, is the spread of the customers across different countries, since the imbalanced dataset doesn't affect this information.

(back to top)

Results with imbalanced dataset

The results obtained on this dataset are not as good as they seem. If we look at the overall accuracy obtained, we see that it seems rather good. However since the data is unbalanced, we see rather poor metrics obtained for the label 0, for precision, recall and f1-score. The classifier has developped a tendency to predict for label 0 since it represents a larger proportion of the labels.

(back to top)

Results with balanced dataset

Method 1: Undersampling

Our first method to combat the unbalanced dataset gives us better results for the metrics of class 1. Although the overall accuracy has decreased, it now represents a better representation of its true value since the dataset is now balanced. The way that this method works it rather simple, whereby we simply remove the number of elements from the overpopulated label. The downside of this is that we are throwing away data that could otherwise be used for our classification.

Method 2: Oversampling

The second method used is oversampling, which is very similar to the first method except that we increase the samples of the underrepresented class by creating duplicated samples. We can see that we obtain good metrics for each class as well as a solid accuracy.

Method 3: SMOTE

Our last and final method is the smote method (synthetic minority oversampling technique) which essentially creates artifical samples for the underrepresented class such that the dataset now becomes balanced. This was the method which yielded the best results for both the metrics of each class as well as the overall accuracy.

(back to top)

Confusion matrix for best method

Interpretation of results

Over the course of this small project we have seen the effect of using an unbalanced dataset. This can gives a misleading accuracy since the classifier will develop a tendency to classify the overrepresented class. In order to tackle this, we have made use of 3 methods and have found that the best one in this case was the SMOTE method.

(back to top)