- Dataset : https://www.kaggle.com/mlg-ulb/creditcardfraud
- Run in Google Collab : Credit Card Fraud
- Metric Used : Area Under the ROC curve
- Technique : Anomaly Detection
This is a classic example to practice anomaly detection . I have followed the steps from Andrew NG's machine learning tutorial for anomaly detection (https://youtu.be/086OcT-5DYI)
- As mentioned in the tutorial , the features used follows gaussian distribution.
- To estimate the probability density at a point ,GaussianMixture is used .
- Training set has all non-anomolous samples
- The anomolous samples are split equally among test and validation sets.
- The optimal threshold value for highest roc_auc_score is chosen using validation set.
- AIC and BIC score is used to estimate the number of clusters . We can also use bayesian mixture model which can give a good estimate of the number of clusters, but it is very slow to train.