Course Project for CS6502 - Applied Big Data and Visualization
Title - Big Data Approach for Credit Card Fraud Detection
Technologies Utilised:
- PySpark
- Google Colab Notebooks
- Microsoft Power BI
This Big Data project focused on credit card fraud detection. It was developed as part of the course "Applied Big Data and Visualizations" (CS6502) at the University of Limerick, taught by Dr. Andrew Ju.
Credit card fraud is a pervasive issue, escalating in frequency over recent years. Our project endeavours to combat this menace by harnessing the power of big data analytics and machine learning. By scrutinizing transactional data and employing advanced algorithms, we aim to unearth patterns and anomalies indicative of fraudulent activity in real time. Through this initiative, we aspire to fortify the security of financial transactions and curtail the proliferation of fraudulent practices. The output from our Project would be predicting whether a transaction is fraudulent or not fraudulent, based on the historical data which is used to train the machine learning models - Perceptron, Logistic Regression, Random Forest. We utilized PySpark within Notebooks on Google Colab. By harnessing the power of in-memory processing, distributed computing, fault tolerance, and advanced analytics, we aimed to enhance the accuracy and efficiency of fraud detection. Our project’s goal is to safeguard financial institutions and consumers against malicious activities.
The notebook covers various stages of the project, including data loading, exploratory data analysis (EDA), data preprocessing, model training, and evaluation. It showcases techniques for handling imbalanced datasets, data cleaning, feature engineering, and model evaluation
- Overview
- Dataset Description
- Data Loading (ETL Pipeline)
- Data Summarizing (Plots, Data Summary, Categorical and Numerical Features, Imbalance in Dataset)
- Data Transformation (Adding 3 new Features)
- Data Preparation (Undersampling, Indexing, Hot Encoding, Scaling)
- Model Training (Perceptron, Logistic Regression, Random Forest)
- Model Evaluation Metrics including why Perceptron is Preferred
Our Power BI dashboard offers a comprehensive analysis of credit card transactions aimed at detecting fraudulent activities. Through a series of visualizations, we delve into various aspects of transaction data to uncover patterns and anomalies indicative of potential fraud. The dashboard provides insights across multiple dimensions, including:
- Categorical breakdowns of fraudulent transactions
- Trends over time through time series analysis
- Spending patterns across different transaction categories
- Geographical analysis of spending behavior
- Financial insights into spending amounts
- Correlations between spending and city populations
- Relational analysis between spending amounts and fraud occurrences
- Gender analysis providing compelling insights into spending habits and frequency of fraudulent transactions according to the gender
- The incorporation of large-scale data analytics and machine learning tools increased the efficiency of detection and monitoring through our project.
- Undersampling performs well for model training.
- The Perceptron classifier gives a marginally higher recall value. Hence utilizing it is beneficial for reducing False Negatives.
- Visualizations aid in making data-driven decisions and visualizing the dataset.
- Continuous vigilance and adaptation are essential in combating evolving fraud tactics.