/IoTGeM

IoT Attack Detection with machine learning

Primary LanguageJupyter NotebookMIT LicenseMIT

IoTGeM: Generalisable Models for Behaviour-Based IoT Attack Detection.

Overview

In this repository you will find a Python implementation of the methods in the paper IoTGeM: Generalisable Models for Behaviour-Based IoT Attack Detection

What is IoTGeM?

The Internet of Things is becoming more and more involved in our daily lives. In parallel with this, IoT is the target of many cyber attacks. Due to IoT's heterogeneous nature and unusual interfaces, the solutions used in classical instruments may not be suitable. This paper proposes a generalisable intrusion detection model based on machine learning. While building this model, we propose a sliding and expanding window-oriented feature set that can detect attacks earlier and with higher performance. We also compare our method with alternatives. In the feature selection phase, we use a genetic algorithm based on external feedback to generate the optimal feature set combination. In order to demonstrate the generalisability of our results, we present our results by comparing the models we have developed with isolated attack data.

Requirements and Infrastructure:

Wireshark and Python 3.6 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.6+ and the following libraries are installed.

Library Task
Scapy Packet(Pcap) crafting
tshark Packet(Pcap) crafting
Sklearn Machine Learning & Data Preparation
Numpy Mathematical Operations
Pandas Data Analysis
Matplotlib Graphics and Visuality
Seaborn Graphics and Visuality

The technical specifications of the computer used for experiments are given below.

Central Processing Unit : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz
Random Access Memory : 8 GB (7.74 GB usable)
Operating System : Windows 10 Pro 64-bit
Graphics Processing Unit : AMD Readon (TM) 530

Implementation:

The implementation phase consists of 5 steps, which are:

  • Feature Extraction
  • Feature Selection
  • Algorithm Selection
  • Performance Evaluation
  • Comparison with Previous Work

Each of these steps is implemented using one or more Python files. The same file was saved with both "py" and "ipynb" extensions. The code they contain is exactly the same. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.

Datasets

The datasets we used in our study are listed below.

Dataset capture year Number of Devices Number of Attacks
CIC-IoT-2022 2022 40 4
BoT-IoT 2018 5 10
Edge-IIoT 2022 9 13
IoT-ENV 2021 5 3
IoT-NID 2019 2 10
Kitsune 2019 9 9
MazeBolt NA NA NA

We used these datasets for different purposes at different stages of our study. Table 1 indicates which part of the dataset and for what purpose it was used in our study.

Datasets

Table 1 - The datasets used in the study and their usage purposes.

1- Feature Extraction (PCAP2CSV)

1.1- Individual and Window based Features

This section uses the pcap2csv tool to extract features from pcap files. Packet-level labels are required for labelling. Some datasets have packet-based labels (such as Kitsune). If you have labels, save them with the same name as the pcap file (such as example.pcap & example.csv). The pcap2csv file will transfer the labels to CSV during the feature extraction process.

If you do not have packet-level labels, the LabelMaker file will generate them from the pcap file. To do this, simply add the WireShark rules for identifying attacks to the dataset_description.csv file (see Fig.1).

Alt text

Fig. 1 - Pcap file labelling with LabelMaker

1.2- Flow-based features

For flow-based feature extraction, we used CICFlowMeter (see Fig.2), a tool that quickly converts pcap files into flow-based features as CSV files. For labelling, most of the databases provide their own labelled CSV files. You can use these labels. We have used a python script to import the labels of some datasets into these files. You can find a few examples of how we did this in the FLOW-LABELLER.ipynb file.

Alt text

Fig. 2 - CICFlowMeter V3 user interface.

2- Feature Selection

In this step, we aimed to obtain the most appropriate feature set by feature selection. We applied feature selection with feature reduction and genetic algorithm respectively.

2.1- Feature Reduction

02.1 Feature_reduction.ipynb: We applied a 3-step voting method for feature reduction. In this method, each feature is used individually in machine learning on 3 different data sets. If the success of this feature in terms of kappa is greater than 0, one vote is given to this feature in that step. These three steps can be summarised as follows.

  • Using a cross-validated session as training and test.
  • Using one of the two sessions as test and the other as training data
  • Using data from two different datasets as training and testing.

This process is visualised in the Fig.3.

Alt text

Fig. 3 - Data for three step feature reductuin.

Of these three steps, the feature with at least two votes is used in the next step, provided that it receives votes from the last step. Features that do not meet this requirement are discarded. An example voting process is given in Fig. 4.

Alt text

Fig. 4 -Voting process during the feature elimination step for the Host Discovery attack.

2.1- Feature Selection With GA

02.2 GA_feature_selection.ipynb: In this step, a feature selection is performed using examples of the same attack on different datasets. The model is trained using the feature set generated by the genetic algorithm and the first dataset. This model is tested with a second dataset and the result of this test is given as feedback to the genetic algorithm. This process is repeated 25 times. The process is visualised in the Fig. 5.

Alt text

Fig. 5 Feature selection using the genetic algorithm with external feedback

03 Algorithm Selection

03.1 Hyperparameter Optimization: In this file, hyperparameter optimization is applied via sklearn-Randomizedsearch to the machine learning models being used. These machine learning models are:

  • Logistic Regression (LR)
  • Decision Tree (DT)
  • Naive Bayes (NB)
  • Support Vector Machine (SVM)
  • Random Forrest (RF)
  • Extreme Gradient Boosting (XGboost)
  • K-Nearest Neighbors (KNN)
  • Multilayer Perceptron (MLP)

04 Performance Evaluation

03.2 ML-MAIN-group.ipynb: In this step, a 3-stage evaluation process is performed. The test data used in steps 2 and 3 of this process are isolated data sets. They have not been used in any previous step.

  • Using a cross-validated session as training and test.
  • Using one of the two sessions as testing and the other as training data.
  • Using data from two different datasets as training and testing.

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

If you use the source code please cite the following paper:

Kahraman Kostas, Mike Just, and Michael A. Lones. IoTGeM: Generalisable Models for Behaviour-Based IoT Attack Detection, arXiv preprint, arxiv:x.x, 2023.

@misc{kostas2023IoTGeM,
      title={{IoTGeM}: Generalisable Models for Behaviour-Based {IoT} Attack Detection}, 
      author={Kahraman Kostas and Mike Just and Michael A. Lones},
      year={2023},
      eprint={2401.01343},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Contact: Kahraman Kostas kahramankostas@gmail.com