graydonhope/Credit-Fraud-Detection

Comparing 10 models to detect credit card transaction fraud

Jupyter Notebook

Credit Card Transaction Fraud Detection

Trained and compared 10 models to detect whether a credit card transaction is fraudulent or not.

Dataset contains 284,807 examples each with 31 features resulting in over 8,829,017 instances parsed.

Note The dataset is too large to upload to github so it is left out of the repo. It can be found at: https://www.kaggle.com/mlg-ulb/creditcardfraud

Some initial data visualization

Transaction amount and time distribution

Box plot to visualize outliers in each class based on amount.

In this kernel, I train and analyze a variety of models with different pre-processing techiques. The first technique for the unbalanced dataset is Random Undersampling

Class distributions before and after random undersampling.

Before

After

Analyzing the features using a correlation matrix to see which ones are likely to be important.

From this correlation matrix, we can see that features V2, V4, V11, and V19 are correlated positvely and that features V10, V12, V14, and V16 are correlated negatively.

Check out the boxplots of these features.

We can see that these features have a great number of outliers which can inhibit our models accuracy. Some of these outliers will be removed. After calculating the interquartile range (statistical dispersion) by subtracting the 25th lower percentiles from the 75th upper percentiles (quartile75 - quartile25) I add an outlier cutoff value of 1.5 to the range. If any point is lower than the (lower quartile * cutoff), it will be removed. Similarly, if any point is greater than the (upper quartile * 1.5) it will also be removed.

Feature V2 contained the highest number of outliers at 46.

It is a good idea to use some clustering algorithms to indicate whether future predictive models will be accurate.

Here are 3 clustering algorithms fit onto the data

We see that the T-distributed stochastic neighbor embedding performs the best.

These are the learning curves of the models after optimizing their hyperparameters.

Note how badly the Random Forest and K Nearest Neighbors classifiers overfit the data.

Displaying the ROC AUC Curves after cross validation

Here we see that Logistic Regression is performing best on the test data.

I will now implement the second technique - SMOTE Oversampling

After training the best logistic regression model from the previous section on the oversampled data, I obtained these results.

I now used TensorFlow as a backend to implement two neural networks, each with one hidden layer. The neural nets will be used to see what dataset provides better accuracy (SMOTE Oversampled vs. Random Undersampled)

Here is the accuracy on the last few epochs:

Random Undersampling

SMOTE Oversampling

We see that the SMOTE Oversampled neural network has a greater accuracy BUT also takes longer to train.

The results of the neural networks in a confusion matrix form. From the results we see that the SMOTE oversampling technique has much better results than random undersampling. The oversampled model misclassified 40 cases whereas the random undersampling misclassified 2823 cases.

As a final technique, I implemented a Voting Classifier along with Bagging and Pasting Ensemble Classifiers.

Here are the results, respectively:

Please Note that these models were originally created by myself but then used Janio Martinez's work as reference for many of the visualization techniques and process.