Box plot to visualize outliers in each class based on amount.
In this kernel, I train and analyze a variety of models with different pre-processing techiques. The first technique for the unbalanced dataset is Random Undersampling
Class distributions before and after random undersampling.
Before
After
Analyzing the features using a correlation matrix to see which ones are likely to be important.
From this correlation matrix, we can see that features V2, V4, V11, and V19 are correlated positvely and that features V10, V12, V14, and V16 are correlated negatively.
Check out the boxplots of these features.
We can see that these features have a great number of outliers which can inhibit our models accuracy. Some of these outliers will be removed. After calculating the interquartile range (statistical dispersion) by subtracting the 25th lower percentiles from the 75th upper percentiles (quartile75 - quartile25) I add an outlier cutoff value of 1.5 to the range. If any point is lower than the (lower quartile * cutoff), it will be removed. Similarly, if any point is greater than the (upper quartile * 1.5) it will also be removed.
Feature V2 contained the highest number of outliers at 46.
It is a good idea to use some clustering algorithms to indicate whether future predictive models will be accurate.
Here are 3 clustering algorithms fit onto the data
We see that the T-distributed stochastic neighbor embedding performs the best.
These are the learning curves of the models after optimizing their hyperparameters.
** Note ** how badly the Random Forest and K Nearest Neighbors classifiers overfit the data.
Displaying the ROC AUC Curves after cross validation
Here we see that Logistic Regression is performing best on the test data.
I will now implement the second technique - SMOTE Oversampling
After training the best logistic regression model from the previous section on the oversampled data, I obtained these results.
I now used TensorFlow as a backend to implement two neural networks, each with one hidden layer. The neural nets will be used to see what dataset provides better accuracy (SMOTE Oversampled vs. Random Undersampled)
Here is the accuracy on the last few epochs:
Random Undersampling
SMOTE Oversampling
We see that the SMOTE Oversampled neural network has a greater accuracy BUT also takes longer to train.
The results of the neural networks in a confusion matrix form. From the results we see that the SMOTE oversampling technique has much better results than random undersampling. The oversampled model misclassified 40 cases whereas the random undersampling misclassified 2823 cases.
As a final technique, I implemented a Voting Classifier along with Bagging and Pasting Ensemble Classifiers.
Here are the results, respectively:
Please Note that these models were originally created by myself but then used Janio Martinez's work as reference for many of the visualization techniques and process.