This repository contains code used for experiments in my BSc final thesis, “Multi-class Classification of Botnet Detection by Active Learning.”
The process of labeling malware samples and network traffic is a costly endeavor in the cybersecurity industry.
This active learning framework enables the efficient creation of effective ML models using a limited amount of data.
This thesis focuses on benchmarking well-known query strategies to determine which strategy and parameters can achieve the best results with the fewest data samples.
- Margin Sampling is the optimal strategy in terms of stability and convergence speed.
- If multiple instances are required in each iteration, Ranked Batch-mode Sampling with a small unlabeled pool may perform well.
To get started, clone the repository:
git clone https://github.com/kei5uke/botnet-active-learning.git
Then, change your current directory and install the dependencies:
cd active_learning
pip install -r requirements.txt
Next, install the MedBIoT and N-BaIoT datasets and store them in the /dataset
directory
The file structure is shown in the directory, so be sure to install the datasets accordingly
You can find the datasets here:
Only a small portion of the datasets is used for the experiments
To generate dataset pickles, run python3 Make_df_MedBIoT.py
and python3 Make_df_N-BaIoT.py
Change common variables in global_variable.py
and shared variables in each file
Now you are ready to run any experimental code in /active_learning
directory