This notebook contains Exploratory Data Analysis and Predictive Machine Learning Model for fraud detection. Fraud detection is valuable to many industries including the banking-financial sectors, insurance, law enforcement, government agencies, and many more.
In recent years we have seen a huge increase in Fraud attempts, making fraud detection important as well as challenging. Despite countless efforts and human supervision, hundreds of millions are lost due to fraud. Fraud can happen using various methods ie, stolen credit cards, misleading accounting, phishing emails, etc. Due to small cases in large population detection of fraud is important as well as challenging.
Data mining and machine learning help to foresee and rapidly distinguish fraud and make quick move to limit costs. Using data mining tools, a huge number of transactions can be looked to spot pattern and distinguish fraud transactions.
Data does not have any NULL value.
step False
type False
amount False
nameOrig False
oldbalanceOrg False
newbalanceOrg False
nameDest False
oldbalanceDest False
newbalanceDest False
isFraud False
isFlaggedFraud False
dtype: bool
step | type | amount | nameOrig | oldbalanceOrg | newbalanceOrg | nameDest | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | PAYMENT | 9839.64 | C1231006815 | 170136.0 | 160296.36 | M1979787155 | 0.0 | 0.0 | 0 | 0 |
1 | 1 | PAYMENT | 1864.28 | C1666544295 | 21249.0 | 19384.72 | M2044282225 | 0.0 | 0.0 | 0 | 0 |
2 | 1 | TRANSFER | 181.00 | C1305486145 | 181.0 | 0.00 | C553264065 | 0.0 | 0.0 | 1 | 0 |
3 | 1 | CASH_OUT | 181.00 | C840083671 | 181.0 | 0.00 | C38997010 | 21182.0 | 0.0 | 1 | 0 |
4 | 1 | PAYMENT | 11668.14 | C2048537720 | 41554.0 | 29885.86 | M1230701703 | 0.0 | 0.0 | 0 | 0 |
The provided data has the financial transaction data as well as the target variable isFraud, which is the actual fraud status of the transaction and isFlaggedFraud is the indicator which the simulation is used to flag the transaction using some threshold value.
Minimum value of Amount, Old/New Balance of Origin/Destination:
amount 0.0
oldbalanceOrg 0.0
newbalanceOrg 0.0
oldbalanceDest 0.0
newbalanceDest 0.0
dtype: float64
Maximum value of Amount, Old/New Balance of Origin/Destination:
amount 9.244552e+07
oldbalanceOrg 5.958504e+07
newbalanceOrg 4.958504e+07
oldbalanceDest 3.560159e+08
newbalanceDest 3.561793e+08
dtype: float64
Since there is no missing and garbage value, there is no need for data cleaning, but we still need to perform data analysis as data contaion huge variation of the value in different columns. Normalization will also imporve the overall accuracy of the machine learning model.
The graph above shows that TRANSFER and CASH_OUT are two most used mode of transaction and we can see that TRANSFER and CASH_OUT are also the only way in which fraud happen. Thus we will focus on this type of transactions.
** Things we can conclude from this heatmap: **
- OldbalanceOrg and NewbalanceOrg are highly correlated.
- OldbalanceDest and NewbalanceDest are highly correlated.
- Amount is correlated with isFraud(Target Variable).
There is not much relation between the features, so we need to understand where the relationship between them depends on the type of transaction and amount. To do so, we need to see the heat map of fraud and nonfraud transactions differently.
There are 2 flags which stand out to me and it's interesting to look onto: isFraud and isFlaggedFraud column. From the hypothesis, isFraud is the indicator which indicates the actual fraud transactions whereas isFlaggedFraud is what the system prevents the transaction due to some thresholds being triggered. From the above heatmap we can see that there is some relation between other columns and isFlaggedFraud thus there must be relation between isFraud.
The total number of fraud transaction is 8213.
The total number of fraud transaction which is marked as fraud 16.
Ratio of fraud transaction vs non-fraud transaction is 1:773.
Thus in every 773 transaction there is 1 fraud transaction happening.
Amount lost due to these fraud transaction is $12056415427.
The plot above clearly shows the need for a system which can be fast and reliable to mark the transaction which is fraud. Since, the current system is letting fraud transaction able to pass through a system which is not labeling them as a fraud. Some data exploration can be helpful to check for the relation between features.
step | type | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 9839.64 | 170136.0 | 160296.36 | 0.0 | 0.0 | 0 |
1 | 1 | 1 | 1864.28 | 21249.0 | 19384.72 | 0.0 | 0.0 | 0 |
2 | 1 | 2 | 181.00 | 181.0 | 0.00 | 0.0 | 0.0 | 1 |
3 | 1 | 3 | 181.00 | 181.0 | 0.00 | 21182.0 | 0.0 | 1 |
4 | 1 | 1 | 11668.14 | 41554.0 | 29885.86 | 0.0 | 0.0 | 0 |
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 121)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=15)
probabilities = clf.fit(train_X, train_y.values.ravel()).predict(test_X)
from sklearn.metrics import average_precision_score
print(average_precision_score(test_y,probabilities))
0.7687057112224541
from sklearn.externals import joblib
with open('RandomForestClassifier.pkl','wb') as RandomForestClassifier:
joblib.dump(clf,RandomForestClassifier)
example
index | step | type | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | |
---|---|---|---|---|---|---|---|---|---|
0 | 2 | 1 | 2 | 181.00 | 181.0 | 0.00 | 0.0 | 0.0 | 1 |
1 | 3 | 1 | 3 | 181.00 | 181.0 | 0.00 | 21182.0 | 0.0 | 1 |
2 | 251 | 1 | 2 | 2806.00 | 2806.0 | 0.00 | 0.0 | 0.0 | 1 |
3 | 252 | 1 | 3 | 2806.00 | 2806.0 | 0.00 | 26202.0 | 0.0 | 1 |
4 | 680 | 1 | 2 | 20128.00 | 20128.0 | 0.00 | 0.0 | 0.0 | 1 |
5 | 0 | 1 | 1 | 9839.64 | 170136.0 | 160296.36 | 0.0 | 0.0 | 0 |
6 | 1 | 1 | 1 | 1864.28 | 21249.0 | 19384.72 | 0.0 | 0.0 | 0 |
7 | 4 | 1 | 1 | 11668.14 | 41554.0 | 29885.86 | 0.0 | 0.0 | 0 |
8 | 5 | 1 | 1 | 7817.71 | 53860.0 | 46042.29 | 0.0 | 0.0 | 0 |
9 | 6 | 1 | 1 | 7107.77 | 183195.0 | 176087.23 | 0.0 | 0.0 | 0 |
display(form)
- Existing rule-based system is not capable of detection of all the fraud transaction.
- Machine learning can be used for the detection of fraud transaction.
- Predictive models produce good precision score and are capable of detection of fraud transaction.