/MalwareDetection

Primary LanguageJupyter NotebookMIT LicenseMIT

MalwareDetection

A Malware Detection Method of Code Texture Visualization Based on CNN Combining Transfer Learning

In our project we focus on malware detection. Cyberspace security is attracting more and more attention. Therefore, detecting malware and its variants is of great significance to Cyberspace. However, the increasing sophistication of malicious variants, such as encryption, polymorphism, and obfuscation, makes it more difficult to identified malware effectively. In this paper, a malware detection method of code texture visualization based on a CNN (Convolutional Neural Networks) combining transfer learning is proposed. We utilize visualization technology to map malicious code into corresponding images with typical texture features and realize the classification of malware. Firstly, in order to quickly acquire and locate the representative texture of malware, we adopt CNN to extract the global and deeper features of malicious code images. CNN model and then transfer the model to the malware classification model to accelerate the convergence of the first model and promote generation performance. We construct an improved objective function in which a novel multi-label of classification proportion is added to solve the problem that the texture change of “.text” section and other sections in malicious code image is not obvious after transfer learning.We collect code samples of nine malware families from Kaggle platform and compared the experimental results before and after transfer. The results show that the novel method can accelerate the convergence of loss function and obtain higher accuracy (92% approx.)

In recent years, with the development of Internet of Things and artificial intelligence technology, there have been more attack methods for Cyberspace such as automatic malware generation tools, obfuscation, and polymorphism technology. The emergence of the new tools and technology allow attackers to do more damage at a lower cost. The number of malicious codes grows exponentially every year. Malware or malicious code is a piece of code that is intentionally compiled or set up to steal privacy, obtain information, extort money, and destroy the system, such as computer virus, Trojan, worms, backdoors, logic bombs etc. The minor changes of malicious code can lead to failure of the signature-based approach. Increasingly, malicious code can easily evade signature-based detection by encrypting, obfuscating, or packaging. Hence such an approach, which uses countable signature-based codes against infinitely growing variants, is rather limited. In order to solve the detection problem of malware being continuously recompiled, changed, packaged and disguised, a lot of studies are required to keep up with the ever-changing malware.

ORIGINAL DATASET

This article selected malicious samples provided by Microsoft Corporation, and collecting 9 common families of malicious code samples, a total of 10868, from the Windows platform. Besides, we expend double the malicious code image data set by flipping, cropping, and changing the brightness of the RGB color channel, for training and testing of classification models. The samples information is shown in the table below. MALICIOUS CODE DATA SET OF TRAINING SAMPLES Malicious code family The number of training samples Type Ramnit 1541 Worm Lollipop 2478 Malicious advertisement Kelihos ver3 2942 Trojan Vundo 475 Trojan Simda 42 Malicious advertisement Tracur 751 Back door Kelihos ver1 398 Span botnet Obfuscator.ACY 1228 Spam botnet Gatak 1013 Spam botnet 13

Building Convolution Neural Network:

  1. Loading and Preprocessing Data: Data is gold as far as deep learning models are concerned. Our image classification model has a far better chance of performing well if you have a good number of images in the training set. Also, the shape of the data varies according to the architecture/framework that we use. ● Plots given bytecode file (consisting hexadecimal numbers without the PE header) to grayscale images. ● img - An instance of a PIL Image representing the converted grayscale image. ● Loads an image from the filesystem as a 3D numpy array.

2.Defining the model’s architecture: This is another crucial step in our deep learning model building process. We used to keep experimenting with the values until we find the best match. ● We have use 3 convolutional layers to build the model. Relu activation function is used for each layer. ● We have use 1 dense layer to build model. Softmax activation used for this layer. ● We add dropout for better training 0.25 and 0.5 in between the layers.

  1. Training the model: For training the model, we require: ● Training images and their corresponding true labels ● Validation images and their corresponding true labels (we use these labels only to validate the model and not during the training phase) We also define the number of epochs in this step. For starters, we will run the model for 10 epochs then we increase it to 20 epochs.14

  2. Estimating the model’s performance: Finally, we load the test data (images) and go through the pre-processing step here as well. We then predict the classes for these images using the trained model. Fig 4.1 - Training with 20 epochs.

  3. Making predictions and Modal result: Fig 4.2 Model accuracy curve and model loss curve15 We can see after the instance of 3 epochs our model gets overfitted. To overcome this problem of overfitting we retained our model with Transfer learning, because it stores the knowledge gained while solving one problem and applies it to a different but related problem.

EXPERIMENTAL RESULT AND ANALYSIS WITH TRANSFER LEARNING

Implementation of Transfer Learning

● Transfer Learning stores the knowledge gained while solving one problem and applies it to a different but related problem. ● Transfer learning has the advantage of decreasing the training time for a learning model and can result in lower generalization error.

Step 1: Import dataset from Google drive through Google colab drive mount. And import libraries such as ● Keras ● Sequential from keras models ● Dence, dropout and flattern from keras layers. ● to_categorial from keras utils ● Con2D amd Maxpooling2D from keras layer ● Image from keras preprocessing ● Numpy , pandas, matplotlib and tqdm

Step 2: Store image in target size. We have 10868 images in our dataset and arranged all in target size as target_size = (32,32,3).17

Step 3: Image Data Augmentation: Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. Training deep learning neural network models on more data can result in more skillful models, and the augmentation techniques can create variations of the images that can improve the ability of the fit models to generalize what they have learned to new images. The Keras deep learning neural network library provides the capability to fit models using image data augmentation via the ImageDataGenerator class. ● Image data augmentation is used to expand the training dataset in order to improve the performance and ability of the model to generalize. ● Image data augmentation is supported in the Keras deep learning library via the ImageDataGenerator class. ● Use shift, flip, brightness, and zoom image data augmentation.

Step 4: Adding the Dense layers along with activation: Initializing the hyperparameters, and Plotting the training and validation loss and accuracy. We have trained model with 80 epochs.

Model accuracy is around 0.9053 with 80 epochs and model loss is around 0.34 with 80 epochs.

CONCLUSIONS

This model proposes a detection method of malicious code visualization based on CNN with transfer learning. We convert the PE files of the malicious samples into a binary file of static disassembly. Combined with computer vision technology, we map these binary files into the corresponding malicious codes grayscale images. There is a big difference in the image texture between different families of malicious codes, while the malicious codes of the homogeneous family have a large similarity in image texture. Therefore, we adopt code visualization technology to display malicious code samples in the form of grayscale images. We pre-train the CNN model and then transfer the model and build a loss function. In addition, we train this model using 10868 malware samples from 9 families provided by Microsoft Corporation. The experimental results show that the improved method proposed in this paper can accelerate the model's convergence and achieve 90.53% accuracy. We have planned to focus on domestic and foreign research on optimization algorithms for deep feature analysis to speed up the computing power of deep learning models.

REFERENCES

● Yuntao Zhao1 , Wenjie Cui1 , Shengnan Geng2 , Bo Bo1 , and Yongxin Feng3,Wenbo Zhang3. "A Malware Detection Method of Code Texture Visualization Based on an Improved Faster RCNN Combining Transfer Learning". Computing in Science and Engineering. IEEE.

www.coursera.org, Deep Learning Specialization by Andrew Ng

● “Python Documentation” www.python.org

● “Python Modules Documentation” www.pypi.org

● AnalyticsVidya and AnalyticsIndiamag blogs on Convolutional Neural Network and Transfer Learning.

● NPTEL, Deep Learning by Mitesh Khapra , IITM.

● Research Paper - Deep Residual Learning for Image Recognition