Introduction

The whole experiment is based on a sample graph dataset from CERT4.2. The sampled user-day nodes and train-test splitting are the same as our previous work (B-CITD) . The code and data here consists of three parts:

0. Sequence pre-processing

All the pre-processing work of sequential activities for user-days are included in file directory (code_for_sequence_process). To try this procedure out, one needs to download the four log files ('logon.csv', 'device.csv', 'file.csv', 'email.csv') used from original CERT4.2dataset, since it is too big for uploading them here. The other alternative for file downloading is from our personal cloud link here. Put downloaded .csv file in cert4.2_data directory and run the two .py files one by one will generate the activity sequence for each user-day sample in our sample dataset, which can be used as input dataset for LSTM feature extraction in the next stage, and will be stored in sample_data directory.

This directory is just for trial and demonstration, and it is not necessary for later steps, since all the sample data needed have already been included in another separated directory.

1. Use LSTM auto-encoder to extract features from sequential activity

To try this part, one only needs to include the sample_data directory from main branch, and run the lstm_feature_extraction.py. After training, a new directory call lstm_feature_extraction_result will be generated to store the result. Sequence length could be set inside program.

2. Detection with ResHybnet model

To try this part, one needs to include the sample_data directory from main branch, and run the detection_with_ResHybnet.py. After training, a new directory call detection_result will be generated to store the result. Different model settings and feature combinations could be set inside program.