A comprehensive case study of malware dataset from Microsoft
Notebook is too long to render (download the zip and run it locally). Here's the learnings from the case study:
Preventing malware attacks to a computer system by identifying whether a given file/software is malware. Identifying the malware files is very crucial for the security of the system.
Data Source: https://www.kaggle.com/c/malware-classification/data
Predict the class (from the 9 label classes) of Malware for a given file
- Minimize multi-class error
- Multi-class probability estimates
- Fast processing and labelling of malwares (~ in minutes)
- Multi-class Log loss
- Confustion matrix
Random split on the dataset for training, cross validation and testing with 64%, 16%, 20% of data respectively.
Here's plots and insights from some of the most impressive EDA results.
Here's few observations from the above plots:
-Labels 1, 2, 3 are most recurring labels/classes of malwares -Labels 8,9 are followed -Label 4,5,7 are the least recurring labels, with fewer data points for these labels.
From the above plot, the size of the byte file might be useful in classifying the type of malware.
Using t-SNE for dimensionality reduction, we try to check if we can classify the labels in a 2-D scatter plots.
With a perplexity value of 50 (number of neighborhood relationships preserved), the graph does not clearly provide distinct boundaries between the different labels.
The performance of different machine learning models is measured with the precision matrix. The matrix results plotted with each trained machine learning model is shown below.
Trained a KNN classifier, obtained the optimal k as 3 with Calibrated CV. The precision matrix for predictions on test set is shown below:
Log loss for classification on test set: 0.089
Trained Logistic regression classifier, with L2 penalty as regularization and sigmoid activation function with Calibrated CV. The precision matrix for predictions on test set is shown below.
Log loss for classification on test set: 0.415
Trained Random Forest model with number of estimators as optimization parameter with Calibrated CV. The best value was found with 1000 estimators. The precision matrix for predictions on test set is shown below.
Log loss for classification on test set: 0.0503
Trained XGB Classifier, and performed RandomizedSearchCV for the parameters learning-rate, estimators, max-depth, col_sampling, subsample etc. The precision matrix for predictions on test set is shown below.
Log loss for classification on test set: 0.0.032
The best performance is observed for XGB model with the following hyper-parameters:
- n_estimators = 1000
- max_depth = 3
- learning_rate = 0.03.