Credit Card Fraud Detection using RapidMiner! 🕵️‍♂️💳

Introduction to the Data: 📊

The data contains credit card transactions over a period of time, and the aim is to determine the fraudulent transactions so that the customers are not charged for items they did not purchase.

The data has 227,846 transaction examples, including 394 fraudulent ones. There are 28 numerical attributes (v1 to v28), the 'time' attribute refers to the time between the current transaction and the first one, the 'amount' refers to the transaction amount, the 'class' is the label attribute which takes '1' in case of fraud transection and '0' otherwise.

The task is a classification task, which should determine the future fraudulent transactions depending on the given historical data. A detailed look at the data is imbalanced (the fraud is 0.173% of the total).

Task 1:

Basic Workflow 📝

In this initial stage of workflow:

• We import the data from the .csv file to RapidMiner, by storing them in a local repository. • Change the type of class attribute from integer to binomial, and then substitute the values: 0 to ‘correct’, 1 to ‘fraud’. • Split the data into 30% test and 70% train using stratified sampling method, ensuring that the class distribution in the train and test set is the same as in the whole data set. We keep the test dataset to the end. In the screenshot attached below, the process is provided for your reference.

Modeling (file: Final Exam basic model import) 🧪💻📊

*Note: in this basic modelling, it took about one hour to train the data using KNN (k=5) for one-fold; therefore, my CPU did not help to do cross-validation on KNN in this stage. Now let us take an overview of the behaviour of the data and apply three classification algorithms without any refinement. The training and test data are obtained by executing the file Exam-1-import. The model is trained by KNN (k=5), decision tree, and Naïve Bayes (cross-valid-fold=10). The first observation is that KNN is much slower than others.

The accuracy is more than 99% for KNN and decision tree and about 97% for Naïve Bayes (validation and test data), which is suitable for all. Still, by looking closely, we see the recall of positive class (fraud) is only 5.08% in KNN increased to 67.67% in the decision tree (validation) and 76.27% decision tree (test), and 82.25% Naïve Bayes (validation) and 84.75% in naïve Bayes.

In the screenshot attached above and below, the process is provided for your reference.

We can assume that the decision tree's performance is much better than other algorithms and good precision and sufficient recall numbers by looking at the results.

Task 2

Data Preparation ✨

In this stage, the data is prepared for modelling and refinement:

The unprocessed training and test data are obtained by executing the file Exam-1-import (section 1).
By analyzing the data, there are no missing values
Any duplicated examples in the training set are checked and removed.
The training set is normalized (transformed to get variables with mean=0 and standard deviation std=1), a dimension reduction method (PCA) is applied (to keep 95% of the total variance, 27 attributes have remained in the output of the PCA, which gives the impression that the original data (numerical ones) is almost uncorrelated and might be the output of another PCA operation).
The normalized and PCA models are grouped to be applied on the test set later to keep the exact dimensions.
Finally, the preprocessed training and test data are stored in the local repository to save preprocessing time each time we run the model.

Hyperparameter Optimization 🎯

The stored preprocessed training and test data set from the previous section avoid imbalance classes mentioned before. The classes (correct, fraud) are to be balanced using sample operation.

Decision Tree: the number of folds in cross-validation and the max depth of the tree are optimized using optimize parameter operation. The best result is by using cross-validation of 20 folds and tree with max-depth=4.

Looking at the performance of the model on the test data, we obtain an accuracy of 96.15% with 88.98% recall of the fraud class (TNR), which is much better than the one in basic modelling, even though that the precision of the fraud is much lower (3.86%), but as mentioned above the actual fraud is more (to some extend).

Naive Bayes: with optimizing the number of folds in the cross-validation, the accuracy of the test data is 94.01%, with an 88.98% recall of the fraud.

KNN: the number of folds is optimized between 2 and 12-fold, and k of the KNN model is optimized between 2 and 10, getting the best k=10 and cross-Val-fold=12.

Ensemble (Voting) 🤝

In this section, the voting ensemble technique is used to help choose the best model of the three.

In voting, to classify an example, each classifier has one vote, and the example is classified as the class having the highest votes.

Using this technique with optimizing the hyperparameters (cross-valid-folds, k in KNN, max-depth, cross-val-fold in decision tree, and cross-val-fold in Naïve Bayes), an accuracy of 96.32% is obtained. The recall of the fraud is 88.98% which is less than the recall of fraud in optimized KNN, which means that voting does not give a better result.

Final Model: 🏆

Before discussing the result, another ensemble technique is used, which is Bagging.

Bagging depends on bootstrapping technique where the training set for every classifier comes from a common training set by sampling and replacement. Then all the results are aggregated for Bagging.

The bagging technique is applied to the three models. The following hyperparameters are fixed to their best values found so far, and the parameter optimization is applied to the sample ratio of the bagging operator.

The fixed parameters are:

KNN: k = 10, cross-valid-fold = 12
Decision Tree: max depth = 4.

By applying this final design on the test data, the result is:

Naïve Bayes: Accuracy= 93.94% with recall of fraud=88.98% AUC =0.956, and ROC moved up closer to the perfection

Decision Tree: Accuracy =94.94% with the recall of fraud 90.68%, AUC=0.976, and ROC gets closer to the top left corner, which means the imbalance of cost gets reduced by using Bagging.

KNN: Accuracy is 98.17% with recall of fraud = 92.37%, The error-type-1 (FPR) = 1 - 92.37% = 7.63% The error-type-2 (FNR) = 1 - 98.18% = 1.82% This means the two types of errors get much smaller, and the cost of imbalance is reasonably reduced. AUC= 0.983, and the ROC is almost perfectly good

The KNN model with these final configurations has the best performance and give the best prediction of the fraud even though it is not highly accurate in predicting the correct transaction as many correct transactions may be considered as a fraud.

The predicted data (using KNN) show that there are 118 fraudulent transactions in the original test data, and 1352 transactions are predicted as fraud. This is resulted because of the imbalance in the cost between the two classes. Still, we accept that detecting the actual fraud is more important than determining a correct transaction as fraud.

YaqoobD/credit-card-fraud-detection-Rapidminer