- analyze the dataset and come up with a model that will best detect fraudalent transactions
- compare different popular models and determine which ones perform better
- explore machine learning validation metrics to determine quality of each model (AUPRC vs AUC-ROC)
- with the highly imbalanced data set, try different data analysis techiniques (oversampling/undersampling)
- go back and use some oversampling method to see how the models change with the availability of more data
- dig into each method's parameters and tune it to see if we can get better results. Determine whether or not the dataset was optimized for this kind of problem
- Dataset was financial transactions dataset from Kaggle
- Reduced Featureset: 28 Features determined from prior PCA analysis. Original features were scrubbed for user anonymity
- Time & Amount are the only two original features
- Total Samples in DataSet: 284,807. Number of Fraudalent transactions: 492 (0.172%) of all transactions. Represented by "Class" Feature
- get XGBoost working
- create baseline model comparison dataframe with confusion matrix results of all the models
- do AUPRC vs AUC-ROC comparison / analysis
- do randomundersampler (but better version)
- do oversampling method for 5k, 10k, 100k, equal parity
- re-run the same algos, see how they change over time with more data, or as the data changes
- visualize the efficacy of each model with more time
