In this repository you will find a Python implementation of the methods in the paper IoTDevID: A Behavior-Based Device Identification Method for the IoT.
Device identification is one way to secure a network of IoT devices, whereby devices identified as suspicious can subsequently be isolated from a network. In this study, we present a machine learning-based method, IoTDevID, that recognises devices through characteristics of their network packets. As a result of using a rigorous feature analysis and selection process, our study offers a generalizable and realistic approach to modelling device behavior, achieving high predictive accuracy across two public datasets. The model's underlying feature set is shown to be more predictive than existing feature sets used for device identification, and is shown to generalise to data unseen during the feature selection process. Unlike most existing approaches to IoT device identification, IoTDevID is able to detect devices using non-IP and low-energy protocols.
Wireshark and Python 3.6 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.6+ and the following libraries are installed.
Library | Task |
---|---|
Scapy | Packet(Pcap) crafting |
tshark | Packet(Pcap) crafting |
Sklearn | Machine Learning & Data Preparation |
xverse | Feature importance/voting |
Numpy | Mathematical Operations |
Pandas | Data Analysis |
Matplotlib | Graphics and Visuality |
Seaborn | Graphics and Visuality |
graphviz | Graphics and Visuality |
The technical specifications of the computer used for experiments are given below.
Central Processing Unit | : | Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz |
Random Access Memory | : | 8 GB (7.74 GB usable) |
Operating System | : | Windows 10 Pro 64-bit |
Graphics Processing Unit | : | AMD Readon (TM) 530 |
The implementation phase consists of 5 steps, which are:
- Feature Extraction
- Feature Selection
- Algorithm Selection
- Performance Evaluation
- Comparison with Previous Work
Each of these steps is implemented using one or more Python files. The same file was saved with both "py" and "ipynb" extensions. The code they contain is exactly the same. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.
There are four files relevant to this section:
- 01.1 Aalto feature extraction IoTDevID
- 01.2 Aalto feature extraction IoTSense - IoT Sentinel
- 01.3 UNSW feature extraction IoTDevID
- 01.4 UNSW feature extraction IoTSense - IoT Sentinel
These files convert the files with pcap extension to single packet-based, CSV extension fingerprint files (IoT Sentinel, IoTSense, IoTDevID individual packet based feature sets) and creates the labeling.
The processed datasets are shared in the repository. However, raw versions of the datasets used in the study and their addresses are given below.
Dataset | capture year | Number of Devices | Type |
---|---|---|---|
Aalto University | 2016 | 31 | Benign |
UNSW-Sydney IEEE TMC | 2016 | 31 | Benign |
UNSW-Sydney ACM SOSR | 2018 | 28 | Benign & Malicious |
Since the UNSW data are very large, we filter the data on a device and session basis. You can access the Pcap files obtained from this filtering process from this link (Used Pcap Files).
In addition, the CSVs.zip file contains the feature sets that are the output of this step and that we used in our experiments. These files:
- Aalto_test_IoTDevID.csv
- Aalto_train_IoTDevID.csv
- Aalto_IoTSense_Test.csv
- Aalto_IoTSense_Train.csv
- Aalto_IoTSentinel_Test.csv
- Aalto_IoTSentinel_Train.csv
- UNSW_test_IoTDevID.csv
- UNSW_train_IoTDevID.csv
- UNSW_IoTSense_Test.csv
- UNSW_IoTSense_Train.csv
- UNSW_IoTSentinel_Test.csv
- UNSW_IoTSentinel_Train.csv
There are three files relevant to this section.
-
02.1 Feature importance voting and pre-assessment of features: This file calculates the importance scores for each feature using six feature score calculation methods. It then votes for features using these scores. It lists the feature scores and the votes they have received and shows them on a plot. The six feature importance score calculation methods used are as follows.
- Information Value using Weight of evidence.
- Variable Importance using Random Forest.
- Recursive Feature Elimination.
- Variable Importance using Extra trees classifier.
- Chi-Square best variables.
- L1-based feature selection.
-
02.2 Comparison of isolated data and CV methods: In this file, the results of the isolated test-training data and the cross-validated data are compared.
-
02.3 Feature selection process using genetic algorithm: In this file, feature selection is performed by using a genetic algorithm.
There are two files relevant to this section.
-
03.1 Hyperparameter Optimization: In this file, hyperparameter optimization is applied via sklearn-Randomizedsearch to the machine learning models being used. These machine learning models are:
- Decision Trees (DT)
- Naïve Bayes (NB)
- Gradient Boosting (GB)
- k-Nearest Neighbours (kNN)
- Random Forest (RF)
- Support Vector Machine (SVM)
-
03. 2 Classification of Individual packets for Aalto Dataset: This file trains machine learning models using the individual packets of Aalto University dataset using the methods mentioned above and the optimised hyperparameters.
There are four files relevant to this section. In our experiments above, we found that DT offers the best balance between predictive performance and inference time among other machine learning methods. Therefore, only DT is used in all our subsequent experiments.
-
04.1 Determination of aagregetion size: In this file, different aggregation sizes are tested. For this purpose, groups of different sizes (from 2 to 25) are formed and the performance results of these groups are observed.
-
04.2 Classification of ind-aag-mixed packets for Aalto Dataset: In this file, results are obtained for the Aalto dataset using individual, aggregated and mixed methods. A group size of 13 was used in the aggregation operations.
-
04.3 Classification of ind-aag-mixed packets for UNSW Dataset: In this file, results are obtained for the UNSW dataset using individual, aggregated and mixed methods. A group size of 13 was used in the aggregation operations.
-
04.4 Aalto results with combined labels: In this file, to deal with lower performance caused by the fact that the Aalto dataset contains many very similar devices, these similar devices are considered as a group and collected under the same label.
There are two files relevant to this section.
-
05.1 Aalto IoTSense & IoTSentinel Normal, Aagregeted, Mixed Results: This file trains machine learning models using Aalto University data for 3 studies (IoTDevID, IoTSense, IoT Sentinel) with an individual, aggregated and mixed approach in order to compare the feature set performances.
-
05.2 UNSW IoTSense & IoTSentinel Normal, Aagregeted, Mixed Results: This file trains machine learning models using UNSW data for 3 studies (IoTDevID, IoTSense, IoT Sentinel) with an individual, aggregated and mixed approach in order to compare the feature set performances.
This project is licensed under the MIT License - see the LICENSE file for details
If you use the source code please cite the following paper:
@misc{kostas2021iotdevid2,
title={{IoTDevID}: A Behavior-Based Device Identification Method for the {IoT}},
author={Kahraman Kostas and Mike Just and Michael A. Lones},
year={2021},
eprint={2102.08866v2},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
Contact: Kahraman Kostas kahramankostas@gmail.com