Machine learning models allow us to deal with classification problems. Take this dataset as an example, machine learning helps us to determine whether the transaction is legit or fraudulent. Since most of the transactions are not fraudulent, dealing with imbalanced data would be the main challenge during the process of this analysis. Therefore, our main goal in this analysis is to build a model that can correctly indentify the type of a transaction, even the dataset is unbalanecd.
This is one of the most classic imbalanced datasets on Kaggle. The datasets contain two-days transactions made by credit cards. The names of the features are not shown due to the confidential issues.
1.Data Preprocessing
1.1 Null Values
1.2 Feature Scaling
1.3 Feature Selection
2.Model Selection & Performance
2.1 Before oversampling or undersampling
2.2 Oversampling
2.2 Undersampling
2.3 SMOTE
My findings:
1.There are 30 predictor variables and 1 target variable with 284807 rows.
2.There is no null values in this data set.
3.Columns 'Time' and 'Amount' are not scaled.
4.The data set is highly unbalanced. There are 492 frauds out of 284808 transactions, where frauds accounting for 0.17%.
5.The distribution of Amount is extremely skewed to the right, centered at about 88. There are some values apparently present on the higher end.
Time and Amount are not scaled, so I apply standardization to both columns.
There are 30 predictor variabels in this data. To reduce the computational cost of modeling, feature selection helps us to extract more informative variables. Since the input variables are numerical and output variables are categorical, ANOVA correlation coefficient is used to select the TOP 10 variables.After applying feature selection methods, it seems that v17,v14,v12,v10,v16,v3,v7,v11 are informative variables.
In this analysis, I use informative variables to build decision tree model, where gini is the criteria that measure the quality of a split.
The accuracy of the model is 0.99, which is extremely high. However, Accuracy is not meaningful when we measure the performance of imbalanced datasets. Other indicators such as Recall, F1-score and ROC should be checked as well.
Oversampling is replicate the positive cases to make the number of positive cases equal to that of negative cases.
To make the number of positibe cases equal to that of negative cases, undersampling randomly deleted nagative cases.
Based on the distribution features of positive cases, SMOTE reproduce some similar positve instances.
After oversampling or undersampling, it is observed that the model have higher accuracy and AUC, compared to the imbalanced one.