/darknet-traffic-classification

Classification of Darknet traffic samples using neural networks.

Primary LanguageMATLAB

Darknet traffic classification

This MATLAB project uses two different neural networks to classify Darknet traffic samples into three classes: Tor, VPN and Benign (Non-Tor + NonVPN).

Dataset

The datasets directory contains two datasets in the CSV format:

  • Darknet.csv - also known as CIC-Darknet2020
  • Darknet_preprocessed.csv - preprocessed dataset

In order to generate the preprocessed dataset and to get the label distribution, run data_preprocessing/main.m.

Selected features

From the dataset, only 28 features were considered:

Average Packet SizeFIN Flag CountFwd Packets/sPacket Length Mean
Bwd Init Win BytesFlow DurationFwd Segment Size AvgPacket Length Std
Bwd Packet Length MaxFlow IAT MaxFwd Seg Size MinPacket Length Variance
Bwd Packet Length MeanFlow IAT MeanIdle MaxProtocol
Bwd Packet Length MinFlow IAT MinIdle MeanSubflow Bwd Bytes
Bwd Packets/sFwd Header LengthIdle MinSubflow Fwd Packets
Bwd Segment Size AvgFWD Init Win BytesPacket Length MaxTotal Length of Bwd Packet

Data normalization

Data underwent normalization within the range of [-1, 1] using the Z-Score, which measures the distance of a data point from the mean in terms of the standard deviation, preserving the shape properties of the original data.

Data balancing

Due to the notable scarcity of Tor samples in comparison to other types of traffic, it was performed data augmentation by utilizing the SMOTE (Synthetic Minority Over-sampling Technique) function. This allows a more representative dataset with diverse and abundant samples from each class to avoid overfitting.

Data splitting

Data was divided into three subsets:

Non-TorNonVPNVPNTor
Training (60%)56,01414,31813,7518,352
Validation (20%)18,6714,7734,5842,784
Testing (20%)18,6714,7724,5842,784
Total93,35623,86322,91913,920

Models

Multilayer Perceptron (MLP)

Comprises an input layer with 28 nodes, corresponding to the number of features, and two hidden layers with 5 nodes each. Lastly, the output layer is composed of 3 nodes. The training of this model employed the Levenberg-Marquardt backpropagation algorithm, with the mean squared error (MSE) utilized as the performance metric to assess the network's performance in a parallel execution environment.

Convolutional Neural Network (CNN)

Composed by an input layer where each input sample has a height of 28 pixels, a width of 1 pixel, and a single channel (grayscale). It also comprises a convolutional layer and a rectified linear unit (ReLU) activation function which is applied element-wise to the output of the convolutional layer. The mentioned structured is then followed by three fully connected layers where the first 2 layers have 5 neurons each, while the last fully connected layer consists of 3 neurons, corresponding to the number of output classes. A softmax layer was added to convert the outputs of the previous fully connected layer into a probability distribution over the classes. The last stage of the model is a classification layer that assigns the predicted class based on the highest probability from the softmax layer. This model was trained using the training subset, with the assistance of the Adam optimizer. The training was conducted for a maximum of 8 epochs, with a mini-batch size of 256 in a parallel execution environment.

Experiments

Running mlp/main.m and/or cnn/main.m will do the following:

  • Read datasets/Darknet_preprocessed.csv and split the data into training, validation and testing subsets.
  • Create and train the MLP/CNN model, respectively.
  • Generate the confusion matrix after feeding the testing subset.

Results

The results from the experiments are presented below in the form of confusion matrices, showcasing the performance of the Multilayer Perceptron (left) and the Convolutional Neural Network (right):

MLP confusion matrix         CNN confusion matrix

Performance

The evaluation metrics used to assess the performance of the modelscan be seen in the table below:

MetricEquationMLPCNN
Accuracy(TP + TN)/(TP + FP + TN + FN)0.940.91
PrecisionTP/(TP + FP)0.870.89
0.860.78
0.970.94
RecallTP/(TP + FN)0.990.94
0.810.69
0.970.95
F12TP/(2TP + FP + FN)0.930.91
0.830.73
0.970.94

TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative