The HAI dataset was collected from a realistic industiral control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.
Click here to find out more about HAI dataset.
Please e-mail us here if you have any questions about the dataset.
- Background
- HAI Testbed
- HAI Dataset
- Getting the Dataset
- Performance Evaluation
- Projects using the Dataset
- Change Log
- Authors
- References
-
In 2017, three laboratory-scale CPS testbeds were initially launched, namely GE’s turbine testbed, Emerson’s boiler testbed, and FESTO’s modular production system (MPS) water-treatment testbed. These testbeds were related to relatively simple processes, and were operated independently of each other.
-
In 2018, a complex process system was built to combine the three systems using a hardware-in-the-loop (HIL) simulator, where thermal power generation and pumped-storage hydropower generation were simulated. This ensured that the variables were highly coupled and correlated for a richer dataset. In addition, an open platform communications united architecture (OPC-UA) gateway was installed to facilitate data collection from heterogeneous devices.
-
The first version of HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from both normal and anomalous situations for 38 attacks. Subsequently, a debugged version of HAI 1.0, namely HAI 20.07, was released for the HAICon 2020 competition in August 2020.
-
HAI 21.03 was released in 2021, and is based on a more tightly coupled HIL simulator to produce clearer attack effects with additional attacks. This provided more quantitative information and covers a variety of operational situations and better insights into the dynamic changes of the physical system.
The testbed consists of four different processes: boiler, turbine, water-treatement and HIL simulation:
-
Boiler Process (P1): A water-to-water heat-trasfer process with low pressure and moderate temperature. It is controlled by Emerson's Ovation DCS.
-
Turbine Process (P2): A rotor kit process that closely simulates the behavior of an actual rotating machine. It is controlled by GE's Mark VIe DCS.
-
Water-treatment Process (P3): A water-treatment process that includes the pumping of water to the upper reservoir and releasing it back into the lower reservoir. It is controlled by Siemens's S7-300 PLC.
-
HIL Simulation(P4): Both of the boiler and turbine processes are interconnected to reamin sychronous with the rotating speed of the virtual steam-trubine power generation model. The pump and value in the water-treatment process are controlled by the pumped-storage hydropower generation model. The dSPACE's SCALEXIO system is used for HIL simulations and is interconnected with the real-world processes through a Siemens S7-1500 PLC and ET200 remote IO devices for data-acquisition system based on OPC gateway.
Two major versions of HAI datasets have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:
Note: The version numbering follows a date-based scheme, where the version number indicates the released date of HAI dataset. HAI 20.07 is the bug-fixed one of the first version HAI v1.0 released in February 2020.
Version | Data Points | Normal Dataset | Attack Dataset | |||||
---|---|---|---|---|---|---|---|---|
Files | Interval | Size | Files | Attack Count | Interval | size | ||
HAI 21.03 | 78 points/sec | train1.csv | 60 hours | 100 MB | test1.csv | 5 attacks | 12 hours | 22 MB |
train2.csv | 63 hours | 116 MB | test2.csv | 20 attacks | 33 hours | 62 MB | ||
train3.csv | 229 hours | 246 MB | test3.csv | 8 attacks | 30 hours | 56 MB | ||
test4.csv | 5 attacks | 11 hours | 20 MB | |||||
test5.csv | 12 attacks | 26 hours | 48 MB | |||||
HAI 20.07 (HAI1.0) |
59 points/sec | train1.csv | 86 hours | 127 MB | test1.csv | 28 attacks | 81 hours | 119 MB |
train1.csv | 91 hours | 98 MB | test1.csv | 10 attacks | 42 hours | 62 MB |
The time-series data in each CSV file satisfies time continuity. The first column represents the observed time as “yyyy-MM-dd hh:mm:ss,” while the rest columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred or not, where the attack column was applicable to all process and the other three columns were for the corresponding control processes.
Refer to the latest technical manual for the details for each column.
time | P1_B2004 | P2_B2016 | ... | P4_HT_LD | attack | attack_P1 | ... | attack_P3 |
---|---|---|---|---|---|---|---|---|
20190926 13:00:00 | 0.09830 | 1.07370 | ... | 0 | 0 | 0 | ... | 0 |
20190926 13:00:01 | 0.09830 | 1.07410 | ... | 0 | 1 | 0 | ... | 1 |
20190926 13:00:02 | 0.09830 | 1.07380 | ... | 0 | 1 | 0 | ... | 1 |
20190926 13:00:03 | 0.09830 | 1.07360 | ... | 0 | 1 | 1 | ... | 1 |
20190926 13:00:04 | 0.09830 | 1.07430 | ... | 0 | 1 | 1 | ... | 1 |
NOTICE: All data files are compressed by the standard GNU zip (gzip) due to a strict maximum size limit of 100 MB for individual files in a repository.
Type git clone
, and the paste the below URL.
$ git clone https://github.com/icsdataset/hai
To unzip multiple gzip files, you can use:
$ gunzip *.gz
It is strongly recommended to use the TaPR (Time-series Aware Precision and Recall) method for evaluating your anomaly detection algorithm, which gives fairness to performance comparisons with other sutides. Got something to suggest? Let us know!
Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!
- HAICon 2020 : https://dacon.io/competitions/official/235624/overview/description
- HAICon 2021 : https://dacon.io/en/competitions/official/235757/overview/description
Please refer to the technical manual for the detailed changes
- HAI 21.03 release (2021-03-25)
- HAI 20.07 release (2020-07-22)
- Initial release (2020-02-07)
Created by Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun and HyoungChun Kim in the Affiliated Institute of ETRI, Daejeon, South Korea.
This work is licensed under a Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).
- Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun, and HyoungChun Kim, "HAI 1.0: HIL-based Augmented ICS Security Dataset", 13th USENIX Workshop on Cyber Security Experimentation and Test (CSET 20), Santa Clara, CA, 2020.
- Hwang, Won-Seok and Yun, Jeong-Han and Kim, Jonguk and Kim, HyoungChun Kim, "Time-Series Aware Precision and Recall for Anomaly Detection: Considering Variety of Detection Result and Addressing Ambiguous Labeling", CIKM '19:Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp.2241-2244, 2019.
- Seungoh Choi, Jeong-Han Yun, Sin-Kyu Kim, "A Comparison of ICS Datasets for Security Research Based on Attack Paths", In: Luiijf E., Žutautaitė I., Hämmerli B. (eds) Critical Information Infrastructures Security. CRITIS 2018. Lecture Notes in Computer Science, vol 11260. Springer, Cham.
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | HIL-based Augmented ICS Security Dataset |
||||||
alternateName | HAI Security Dataset |
||||||
alternateName | hai seucrity dataset |
||||||
url | https://github.com/icsdataset/hai |
||||||
sameAs | https://github.com/icsdataset/hai |
||||||
description | The HAI security dataset was collected from a realistic Industiral Control System (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.
|
||||||
provider |
|
||||||
license |
|
||||||
citation | https://www.usenix.org/conference/cset19/presentation/shin
https://dl.acm.org/doi/10.1145/3357384.3358118
|