Credit_Card_Fraud_Detection

1.Introduction

Machine learning models allow us to deal with classification problems. Take this dataset as an example, machine learning helps us to determine whether the transaction is legit or fraudulent. Since most of the transactions are not fraudulent, dealing with imbalanced data would be the main challenge during the process of this analysis. Therefore, our main goal in this analysis is to build a model that can correctly indentify the type of a transaction, even the dataset is unbalanecd.

2.Dataset

This is one of the most classic imbalanced datasets on Kaggle. The datasets contain two-days transactions made by credit cards. The names of the features are not shown due to the confidential issues.

3.Overview of Analysis

1.Data Preprocessing
  1.1 Null Values
  1.2 Feature Scaling
  1.3 Feature Selection
2.Model Selection & Performance
  2.1 Before oversampling or undersampling
  2.2 Oversampling
  2.2 Undersampling
  2.3 SMOTE

Data preprocessing

1.1 Null Value&EDA

My findings:
    1.There are 30 predictor variables and 1 target variable with 284807 rows.
    2.There is no null values in this data set.
    3.Columns 'Time' and 'Amount' are not scaled.
    4.The data set is highly unbalanced. There are 492 frauds out of 284808 transactions, where frauds accounting for 0.17%.
    5.The distribution of Amount is extremely skewed to the right, centered at about 88. There are some values apparently present on the higher end.

1.2 Feature Scaling

Time and Amount are not scaled, so I apply standardization to both columns.

1.3 Feature Selection

There are 30 predictor variabels in this data. To reduce the computational cost of modeling, feature selection helps us to extract more informative variables. Since the input variables are numerical and output variables are categorical, ANOVA correlation coefficient is used to select the TOP 10 variables.After applying feature selection methods, it seems that v17,v14,v12,v10,v16,v3,v7,v11 are informative variables.

Model Selection & Performance

2.0 Imbalanced datasets

In this analysis, I use informative variables to build decision tree model, where gini is the criteria that measure the quality of a split. The accuracy of the model is 0.99, which is extremely high. However, Accuracy is not meaningful when we measure the performance of imbalanced datasets. Other indicators such as Recall, F1-score and ROC should be checked as well.