Detecting credit card fraud using classic ML algorithms. Nothing fancy here.
Dataset is supplied by Kaggle, IEEE-CIS Fraud Detection.
You can check about the contest here,
https://www.kaggle.com/c/ieee-fraud-detection/data
This repository was created to practice applying ML algorithms on imbalanced datasets, and try feature selection techniques.
The inherently imbalanced dataset was balanced using random undersampling. Oversampling using SMOTE was also considered but discarded since the machine running the notebook was old, and the dataset is large (over half a million rows with 430+ columns).
Feature selection on the dataset was done using the Boruta algorithm (Borutapy library). Correlations among the features were checked before and after feature selection. Below are the feature correlations of the selected features,
Logistic Regression Classifier was used to predict whether a transaction was fraud or not. Since no feature engineering was performed, the classifier had a recall of 56% on the test dataset. Note that in these kind of ML problems, recall matters more than accuracy.
Below is the confusion matrix obtained from classifier predictions,