/Financial-Fraud-Detection

Primary LanguageJupyter NotebookMIT LicenseMIT

Financial Fraud Detection

Introduction

What is fraud detection?

Fraud detection is a set of methods used to prevent money or assets from being gained through deception.

Fraud detection Techniques

  • Machine Learning
  • Data Mining
  • Neural networks
  • Pattern Recognition

Fraud detection using machine learning

Machine learning algorithms detect patterns in financial transactions and determine whether they are valid.They have the ability to detect fraudulent transactions that human auditors might miss, and they can do so in real time.

In this work we have used publicly available simulated payment transaction data and implemented various supervised machine learning algorithms to the problem of fraud detection. We aim to highlight how supervised machine learning techniques may be utilised to accurately classify data with substantial class imbalance.

Data Description

  • The dataset has over 6 million transactions and 11 variables
  • There is a variable named ‘isFraud’ that indicates actual fraud status of the transaction, this is the class variable for our analysis

The columns in the dataset are described as follows:

Methodology

methodology

Data analysis:

1. Data Cleaning

  • Type conversion
  • Dropping columns
  • Removing unwanted rows (PAYMENT, CASH-IN and DEBIT) and managed to reduce the data from over 6 million transactions to ~2.8 million transactions
  • Creating dummy variables

2. Exploratory Data Analysis

  • Maximum transaction amount was via Transfer while debit being the least

  • Fraudulant transactions oocured only by transactions via Cash Out and Transfer

  • People use Cash Out method the most for transaction while Debit being the least

Model Building

  • The data is normalized using Sklearn Standard Scalar
  • The data is splitted into three respective sets:
    • Train set(98%)
    • Dev set(1%)
    • Test set(1%)
  • The train data was then fed into the model
    • We trained the dataset on three models
      • Random Forest Regressor
      • Naive Bayes
      • Logistic Regression

Data Visualization

  • Gaussian Graph

  • Scatter Plot

Observations

  • The transaction is fraudulent if the transfer amount is substantially less or more than the new balance of the destination

  • According to the dataset given, if the destination is merchant then the account balance will be hidden. Therefore, we've put this into the consideration that the models performance doesn't get affected, which is done by keeping destination account column

  • By this we can say that the transaction amount is maximum when step is between 300 and 350 and decreases drastically as the step increases

Conclusion

According to the bar graph shown, we can conclude that logistic regression seems to produce excellent results. The performance of the Logistic regression is consistent between the training and testing datasets, so there is no overfitting