Working with Highly Unbalanced Data for Fraud Classification Problem:
- Performed Undersampling because of Data is highly unbalanced. (99% data- Fraud).
- Separate Test Case Sample prior to any EDA, Sampling, Scaling etc.
- For Features having high correlation with Dependent variable used IQR, Box-Plot to filter outliers.
- used t-SNE to classify the classes for better understanding.
- Used Decision Tree, Logistic Regression, Random Forest Classifier to train the Undersampled data.
- Used GridSearchCV for hyperparameter tuning.
- Used Stratified Cross Validation to avoid Overfitting of data.
- Calculated metrics such as ROC-AUC curve (since accuracy works well for Balanced data). - 97.96% using Logistic Regression.