This repository is the open source code for our paper entitled "Deep Learning Hierarchical Representation from Heterogeneous Flow-level Communication Data", which is under review in IEEE Transactions on Information Forensics and Security (TIFS).
We present an alternative approach to the feature engineering process and show that it can replicate and optimize the key steps involved in feature engineering and learn hierarchical representations of communication behavior from heterogeneous communication data. The approach consists of two steps. First, the fixed-size encoding-data are extracted based on the idea of spatial pyramid pooling (SPP) to preserve the spatiotemporal characteristics of the communication data. This enables deep learning to be applied to heterogeneous communication data. Then, the convolutional neural network (CNN) structure is adopted to construct a feature extractor that automatically learns hierarchical and robust representations of communication behaviors without expert knowledge.
Here, we take the CTU-Malware dataset as an example to carry out the experiment. The CTU-Malware dataset consists of hundreds of captures (called scenarios) of malware samples (e.g., Neris, Rbot, Virut, DonBot, Nsis). Both malware and normal samples are included in the CTU-Malware dataset. The download address of the dataset is https://mcfp.felk.cvut.cz/publicDatasets/.
The user guide is presented as follow.
a. Download the original CTU Malware Dataset (As listed in “CTU_Malware_Dataset_Urls.txt”) into the [Orignal_Dir]
Tips: [Orignal_Dir] is a folder for storing sample files.
The format is: “SrcIP,DstIP,SrcPort,DstPort,Proto,SendPacketCount,ReceivePacketCount,PacketCount,SendLength,ReceiveLength,TotalLength,Time,FirstPacketArrive,LastPacketArrive,Duration,Label”.
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Reformat [Orignal_Dir] [Reformat_Dir]
Tips: [Reformat_Dir] is a folder for storing the reformat files.
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_SplitSample [Reformat_Dir] [CommPair_Dir]
Tips: [CommPair_Dir] is a folder for storing the files of communication pairs.
The number of flows is reflected in the filename so that the SPP-encoding program (com.sgl.spp.SPPNetEncoding) can filter the CommPair files with fewer samples according to the filename.)
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_SplitSample_CountInFilename [CommPair_Dir]
Command: java -cp CTU_SPP.jar com.sgl.spp.SPPNetEncoding [CommPair_Dir] [Label] [SPP_File]
Tips: [Label] indicates the type (Normal or Malware) of CommPair files under the current folder ([CommPair_Dir]). [SPP_File] indicates the file that stores the encoding result.
Command: java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Normalization_ZScore [SPP_Dir] 1134 calParam; java -cp CTU_SPP.jar com.sgl.preprocessing.Preprocess_Normalization_ZScore [SPP_Dir] 1134 doNormalization
Tips: [SPP_Dir] is a folder for storing SPP_Files. In order to make the program run normally, the filename of each [SPP_File] in [SPP_Dir] should start with "with_label_".
Command: python ExtractFeature2CSV.py [k] [K] [Malware_Sample_File] [Normal_Sample_File]
Tips: [K] indicates K-fold cross-validation. [k] indicates the k-th cross-validation