Malware Communication Detection Based on Deep Learning

Description

This repository is the open source code for our paper entitled "Deep Learning Hierarchical Representation from Heterogeneous Flow-level Communication Data", which is under review in IEEE Transactions on Information Forensics and Security (TIFS).

We present an alternative approach to the feature engineering process and show that it can replicate and optimize the key steps involved in feature engineering and learn hierarchical representations of communication behavior from heterogeneous communication data. The approach consists of two steps. First, the fixed-size encoding-data are extracted based on the idea of spatial pyramid pooling (SPP) to preserve the spatiotemporal characteristics of the communication data. This enables deep learning to be applied to heterogeneous communication data. Then, the convolutional neural network (CNN) structure is adopted to construct a feature extractor that automatically learns hierarchical and robust representations of communication behaviors without expert knowledge.

Dataset

Here, we take the CTU-Malware dataset as an example to carry out the experiment. The CTU-Malware dataset consists of hundreds of captures (called scenarios) of malware samples (e.g., Neris, Rbot, Virut, DonBot, Nsis). Both malware and normal samples are included in the CTU-Malware dataset. The download address of the dataset is https://mcfp.felk.cvut.cz/publicDatasets/.

How to run the code?

The user guide is presented as follow.


1. Data preprocessing (Implemented based on Scala and Java)

a. Download the original CTU Malware Dataset (As listed in “CTU_Malware_Dataset_Urls.txt”) into the [Orignal_Dir]

Tips: [Orignal_Dir] is a folder for storing sample files.

b. Reformat each flow data in the original CTU Malware Dataset into a specified data format.

The format is: “SrcIP,DstIP,SrcPort,DstPort,Proto,SendPacketCount,ReceivePacketCount,PacketCount,SendLength,ReceiveLength,TotalLength,Time,FirstPacketArrive,LastPacketArrive,Duration,Label”.

Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Reformat [Orignal_Dir] [Reformat_Dir]

Tips: [Reformat_Dir] is a folder for storing the reformat files.

c. Aggregate flows into communication pairs.
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_SplitSample [Reformat_Dir] [CommPair_Dir]

Tips: [CommPair_Dir] is a folder for storing the files of communication pairs.

d. Count the flow records in each CommPair file and rename the filename

The number of flows is reflected in the filename so that the SPP-encoding program (com.sgl.spp.SPPNetEncoding) can filter the CommPair files with fewer samples according to the filename.)

Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_SplitSample_CountInFilename [CommPair_Dir]

2. SPP-encoding (Implemented based on Scala and Java)

a. Encoding the CommPair files into 1134-dimensional data
Command: java -cp CTU_SPP.jar com.sgl.spp.SPPNetEncoding [CommPair_Dir] [Label] [SPP_File]

Tips: [Label] indicates the type (Normal or Malware) of CommPair files under the current folder ([CommPair_Dir]). [SPP_File] indicates the file that stores the encoding result.

b. Normalization of encoding data
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Normalization_ZScore [SPP_Dir] 1134 calParam; java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Normalization_ZScore [SPP_Dir] 1134 doNormalization

Tips: [SPP_Dir] is a folder for storing SPP_Files. In order to make the program run normally, the filename of each [SPP_File] in [SPP_Dir] should start with "with_label_".


3. Extracting the features from the SPP_Files (Implemented based on Keras and TensorFlow)

Command: python ExtractFeature2CSV.py [k] [K] [Malware_Sample_File] [Normal_Sample_File]

Tips: [K] indicates K-fold cross-validation. [k] indicates the k-th cross-validation