/Malware-classification-paper-implementation-using-Convolutional-Neural-Network

Implementation of IEEE paper: Malware Classification with Deep Convolutional Neural Networks

Primary LanguageJupyter Notebook

Malware-classification-paper-implementation

Reimplementation of IEEE paper: Malware Classification with Deep Convolutional Neural Networks

Files: MMCD implementation = Microsoft Malware Classification Dataset implementation Malimg classification implementation = Malimg dataset implementation

Summary:

Problem: Previous researchers on Malware classification suggests that malware typically fall into a family that shares common behavior so building a method that can efficiently classify a malware based on its family irrespective of being a variant seems a good approach and a one which can deal with the rapid growth of malware.

Solution: Data was given in the raw form (binary values in milimg dataset and hexa values in Microsoft dataset). They were converted into 2D decimal valued matrix which is used to represent an image. In the paper, researchers used CNN for classification of malware, they used multiple convolution+relu and max pooling layers in the hidden layers and then used fully connected RELU layer at the end. And then applied softmax to make activation function to predict the classes of the image.

Approach: I applied the complete approach of paper. I first made an input layer and then used transfer learning using imagenet for training. And then applied a densed relu layer. At the end a softmax layer was applied to predict the class of the image.

Evaluations and Results:

Malimg dataset: I used the whole data of this dataset which is a small dataset of around 10,000 iamges of around 17 mb. I applied different hyper parameters. With 4096 layers on last layer and rmsprop as an optimizer and VGG16 model. I achieved a train accuracy of 81% and achieved an accuracy of 79.6% on test data. With 2048 layers on last layer and SGD as the optimizer and VGG16 model.I achieved a train accuracy of 75% and achieved an accuracy of 73.5% on test data. With 3072 layers on last layer and adam as an optimizer and VGG16 model. I achieved a train accuracy of 80% and achieved an accuracy of 78% on test data. With 4096 layers on last layer and SGD as the optimizer and VGG16 model. I achieved a train accuracy of 76% and achieved an accuracy of 75% on test data. With 1024 layers on last layer and SGD as the optimizer and VGG19 model. I achieved a train accuracy of 77.01% and achieved an accuracy of 76% on test data. With 1024 layers on last layer and adam as the optimizer and VGG19 model. I achieved a train accuracy of 82.53% and achieved an accuracy of 81.6% on test data. Point to be noted is that I only ran 10 epochs per run so this result can definitely be improved if run for larger number of epochs.

Microsoft Malware Classification dataset: Because the dataset is very large (around 10,000 images of around 500 gb) so I used only a part of it, 500 images of 128*128 size. I ran my model only once on this dataset and I achieved an accuracy of 52% on 10 epochs on ADAM on vgg16 with 1024 neurons in 2nd last layer.

Point to be noted: I am sure that if I had run this for large number of epochs and used more data then I would have achieved better accuracy. And this dataset was far more difficult to pre-process as compared to malimg dataset.

Final word on the result: Though I didn’t achieve the result of paper, I achieved a very comprehensible results considering that I had limited hardware resources like paper's environment was intel core i7 with 64GB RAM but I only had i3 with 6GB RAM .

IEEE paper link : https://ieeexplore.ieee.org/document/8328749/ New Malimg dataset link: https://www.kaggle.com/afagarap/malimg-dataset (not working anymore) Malimg dataset link: https://www.dropbox.com/s/ep8qjakfwh1rzk4/malimg_dataset.zip?dl=0

Microsft classification dataset link : https://www.kaggle.com/c/malware-classification