/IoTDevID-CIC

Application of IoTDevID to CIC-IoT-2022 dataset

Primary LanguageJupyter NotebookMIT LicenseMIT

Externally validating the IoTDevID device identification methodology

With some lessons learned for modelling IoT devices

Overview

In this repository you will find a Python implementation of the methods in the paper Applying IoTDevID to a New Dataset: the CIC IoT Dataset 2022 Case Study.

Summary

In the era of rapid IoT device proliferation, recognizing, diagnosing, and securing these devices are crucial tasks. The IoTDevID method (IEEE Internet of Things ’22) proposes a machine learning approach for device identification using network packet features. In this article we present a validation study of the IoTDevID method by testing core components, namely its feature set and its aggregation algorithm, on a new dataset. The new dataset (CIC IoT Dataset 2022) offers several advantages over earlier datasets, including a larger number of devices, multiple instances of the same device, both IP and non-IP device data, normal (benign) usage data, and diverse usage profiles, such as active and idle states. Using this independent dataset, we explore the validity of IoTDevID’s core components, and also examine the impacts of the new data on model performance. Our results indicate that data diversity is important to model performance. For example, models trained with active usage data outperformed those trained with idle usage data, and multiple usage data similarly improved performance. Results for IoTDevID were strong with a 92.50 F1 score for 31 IP-only device classes, similar to our results on previous datasets. In all cases, the IoTDevID aggregation algorithm improved model performance. For non-IP devices we obtained a 78.80 F1 score for 40 device classes, though with much less data, confirming that data quantity is also important to model performance.

drawing

Fig 1 - A brief overview of the IoTDevID methodology.

Requirements and Infrastructure:

Wireshark and Python 3.10 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.10+ and the following libraries are installed.

Library Task
Scapy Packet(Pcap) crafting
tshark Packet(Pcap) crafting
Sklearn Machine Learning & Data Preparation
Numpy Mathematical Operations
Pandas Data Analysis
Scipy Data Analysis, Mathematical Operations
Matplotlib Graphics and Visuality
Seaborn Graphics and Visuality
tabulate Pretty-print tabular data output
tqdm Progress meter

The technical specifications of the computer used for experiments are given below.

Central Processing Unit : 12th Gen Intel(R) Core(TM) i7-12700H 2.30 GHz
Random Access Memory : 16.0 GB (15.7 GB usable)
Operating System : Windows 11 Home

Data:

Using the CIC IoT Dataset 2022 data, feature extraction was performed, and the feature sets obtained were used in different ways at different stages of the study as indicated in the following table.

Data Description
PCAP Files Raw Network data, Input of Feature Extraction - Used in Section 3
All Sessions [54 CSV] Output of Feature Extraction, Used in Section 4.1
AA, AI, IA, II Merged Sessions
AA, AI, IA, II %10 sample Size reduced merged sessions - Used in Section 4.2/4.3
AA+non-IP Devices Size reduced AA with Non-IP/Zigbee data - Used in Section 4.4

Implementation:

We used jupyter notebook (ipynb) to present the codes. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.

Feature Extraction (PCAP2CSV)

Section 3.2 in the article

  • 01.0 - Features_Extraction: This file convert the files with pcap extension to single packet-based, CSV extension fingerprint files and creates the labeling.

  • 01.1 - Unknown-MAC-cleaning: This file removes fingerprints other than known MAC addresses. These fingerprints are unlabelled because their MAC addresses are unknown.

  • 01.2 - Creating_smaller_DF_with_Selected_features: In feature extraction, about 100 features are created. However, we will not use most of these features. This file reduces the file size by removing the features we don't use.

  • 01.3 - Creating Session_ID.ipynb: This file assigns an identification number to each session to indicate which sessions have the same devices. And it collects devices of the same brand and model under one label, for example: Teckin Plug 1 / Teckin Plug 2 --> Teckin Plug

PERFORMANCE EVALUATION

Section 4.1 in the article

Section 4.2/4.3 in the article

Section 4.4 in the article

  • 04.0 - Preprocessing other data: Non-IP devices are filtered from Power and Interactions sessions and added to Active training and testing datasets.
  • 04.1 - General evaluation with other data: In this file, results are obtained for the Idle and Active datasets using individual, and aggregated methods with Non-IP devices. The group size of 13 was used in the aggregation operations.

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

If you use the source code please cite the following paper:

@misc{kostas2023CIC,
      title={Externally validating the {IoTDevID} device identification methodology}, 
      author={Kahraman Kostas and Mike Just and Michael A. Lones},
      year={2023},
      eprint={https://arxiv.org/abs/2307.08679},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Contact: Kahraman Kostas kahramankostas@gmail.com