This is a simulated credit card transaction dataset containing legitimate and fraudulent transactions from the duration of January 1, 2019, to December 31, 2020. It covers credit cards of 1000 customers performing transactions with 800 merchants.
Dataset Link : [https://drive.google.com/drive/folders/1sDzIPjCmNZ9lWaXfcAqIIx4NZchYG4OP]
- Build a model to detect fraudulent credit card transactions.
- Experiment with various machine learning algorithms like Logistic Regression, Decision Trees, Random Forests, XGBoost, etc., to classify transactions as fraudulent or legitimate.
- Experiment with different sampling techniques to handle imbalanced data.
- Loading the Data: The dataset is loaded into a pandas DataFrame.
- Exploratory Data Analysis: Basic visualizations and statistics are computed to understand the data.
-
Data Preparation:
- Cleaning the data
- Handling missing values
- Handling Categorical fetures
- converting Categorical features into Numerical features using techniques like
one hot encoding
target encoding
label encoding
. - Feature engineering
- Features Scaling / Normalization / Standardization.
-
Model Selection: Various machine learning models are defined including:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Trees
- Random Forest
- XGBoost
- Support Vector Machine (SVM)
-
Cross-Validation: The models are trained using repeated KFolds / stratified k-fold cross-validation.
-
Handling Imbalance using Oversampling Techniques: Techniques like Random Oversampler, SMOTE Oversampler and ADASYN Oversampler are used to handle class imbalance.
-
Model Training: The models are trained using repeated KFolds / stratified k-fold cross-validation, for both with and without oversampling.
-
Prediction: The models are utilized to predict the classification of transactions.
-
Model Evaluation: The models are evaluated using metrics such as Confusion Matrix, ROC Curve, AUC, Precision-Recall Curve, and Classification Report.
-
Hyperparameter Tuning: GridSearchCV / RandomizedSearchCV are used for hyperparameter tuning.
Some Other typical Models for Imbalance data
- Isolation Forest
- Local Outlier Factor
Model Saving: The best performing models are saved using joblib
.
Features category
, amt
(means transaction amount), transaction_hour
, age
, gender
, merchant_encoded
, and city_encoded
have significant impact on the predictions of the fraud and legitimate transactions.
-
For Best Recall -->>
XGBOOST model with Random Oversampling with RepeatedKFold CV
- Accuracy: 0.996809527
- Precision: 0.64
- Recall: 0.96
- ROC_AOC: 0.999117511
- Threshold: 0.195022404193878
-
For Best in over all Good Precision and Descent Recall-->>
XGBOOST model with RepeatedKFold CV
without any oversampling- Accuracy: 0.999125454
- Precision: 0.96
- Recall: 0.87
- ROC_AOC: 0.99951423
- Threshold: 0.006252252