This project involves analyzing a credit card transaction dataset to detect fraudulent transactions. The analysis includes data preprocessing, exploratory data analysis (EDA), and the application of various machine learning models to predict fraudulent transactions.
The following libraries were used in this project:
- pandas: For data manipulation and analysis.
- warnings: To suppress warnings during the analysis.
- matplotlib: For data visualization.
- seaborn: For advanced data visualization.
- scikit-learn: For machine learning algorithms and data preprocessing.
The dataset was loaded using pd.read_csv('creditcard.csv')
.
df.info()
: Provided a concise summary of the DataFrame, including the number of entries, column names, data types, and memory usage.df.describe()
: Offered a statistical summary of the numerical columns, including count, mean, standard deviation, min, 25th percentile, median, 75th percentile, and max values.df.isnull().sum()
: Detected and counted missing values in each column.df.isnull().sum().max()
: Found the maximum number of missing values across all columns.
Computed the correlation between each numerical column and the Class
column, which indicates whether a transaction is fraudulent.
- Heatmap: Created a heatmap of the Pearson correlation coefficients for the features.
- Count Plot: Visualized the distribution of the
Class
column to show the imbalance between genuine and fraudulent transactions. - Histograms: Displayed side-by-side histograms for the distribution of transaction amounts and transaction times.
- Density Plot: Visualized the distribution of transaction times for genuine and fraudulent transactions.
- Box Plots: Created side-by-side box plots for the distribution of transaction amounts by class.
- KDE Plots: Generated Kernel Density Estimate plots for each feature, comparing their distributions between genuine and fraudulent transactions.
Standardized the Amount
and Time
features to have a mean of 0 and a standard deviation of 1 using StandardScaler
from scikit-learn.
Split the data into training and testing sets using train_test_split
from scikit-learn. The target variable was Class
, and the features were all other columns.
Four different classifiers were set up using scikit-learn:
- Logistic Regression:
LogisticRegression()
- Decision Tree Classifier:
DecisionTreeClassifier()
- K-Nearest Neighbors Classifier:
KNeighborsClassifier()
- Gaussian Mixture Model:
GaussianMixture()
- Each classifier was trained using the training data (
X_train
andy_train
). - Cross-validation scores were computed using 5-fold cross-validation to evaluate the performance of each model.
The results of the model evaluations, including accuracy, precision, recall, and F1-score, were compared to determine the best-performing model for detecting fraudulent transactions.
-
Logistic Regression:
- Accuracy: 0.9991
- Confusion Matrix: [[56854 10] [39 59]]
- F1 Score: 0.7065868263473054
-
Decision Tree Classifier:
- Accuracy: 0.9992
- Confusion Matrix: [[56832 32] [15 83]]
- F1 Score: 0.7487684729064039
-
K-Nearest Neighbors Classifier:
- Accuracy: 0.9995
- Confusion Matrix: [[56859 5] [22 76]]
- F1 Score: 0.8491620111731844
Based on the evaluation metrics, the K-Nearest Neighbors Classifier performed the best overall, with the highest accuracy, confusion matrix and F1-score. Given the imbalanced nature of the dataset, precision and recall were particularly important metrics. Therefore, the K-Nearest Neighbors Classifier was selected as the final model for detecting fraudulent transactions.