botnet_active_learning

This repository contains code used for experiments in my BSc final thesis, “Multi-class Classification of Botnet Detection by Active Learning.”

Thesis in Nutshell

The process of labeling malware samples and network traffic is a costly endeavor in the cybersecurity industry.
This active learning framework enables the efficient creation of effective ML models using a limited amount of data.
This thesis focuses on benchmarking well-known query strategies to determine which strategy and parameters can achieve the best results with the fewest data samples.

Figure 1. Cycle of Active Learning

Figure 2. Uncertainty Sampling VS Query by Committee VS Random Sampling

Figure 3. Ranked Batch-mode Sampling VS Random Sampling

Conclusion

Margin Sampling is the optimal strategy in terms of stability and convergence speed.
If multiple instances are required in each iteration, Ranked Batch-mode Sampling with a small unlabeled pool may perform well.

Setup Instruction (For those who want to run the code)

Environment

To get started, clone the repository:

git clone https://github.com/kei5uke/botnet-active-learning.git

Then, change your current directory and install the dependencies:

cd active_learning
pip install -r requirements.txt

Next, install the MedBIoT and N-BaIoT datasets and store them in the /dataset directory
The file structure is shown in the directory, so be sure to install the datasets accordingly
You can find the datasets here:

Dataset Pickels

Only a small portion of the datasets is used for the experiments
To generate dataset pickles, run python3 Make_df_MedBIoT.py and python3 Make_df_N-BaIoT.py

Experiment

Change common variables in global_variable.py and shared variables in each file
Now you are ready to run any experimental code in /active_learning directory