This repository is part of IPAL - an Industrial Protocol Abstraction Layer. IPAL aims to establish an abstract representation of industrial network traffic for subsequent unified and protocol-independent industrial intrusion detection. IPAL consists of a transcriber to automatically translate industrial traffic into the IPAL representation, an IDS Framework implementing various industrial intrusion detection systems (IIDSs), and a collection of evaluation datasets. For details about IPAL, please refer to our publications listed down below.
Industrial systems are increasingly threatened by cyber attacks with potentially disastrous consequences. To counter such attacks, industrial intrusion detection systems strive to uncover even the most sophisticated breaches in a timely manner. Due to its criticality for society, this fast-growing field attracts researchers from diverse backgrounds, resulting in huge momentum and diversity of contributions. Consequently, due to a lack of standard interfaces, standard tools for evaluating IDSs do not exist. Based on IPAL - a common message format that decouples IIDSs from domain-specific communication protocols, the ipa-evalute
tool serves to address this problem by combining a variety of commonly used performance metrics into a single solution for scientific evaluation.
This repository contains the ipal-evaluate
tool along with several useful scripts which facilitate easy and unified evaluations in industrial IDS research.
The evaluation framework contains implementations of the following metrics. Note that we distinguish between metric types that operate on point-based (in arbitrary order) and time-series (temporarily ordered) data. For a detailed description of the metrics, please refer to our publication "Evaluations in Industrial Intrusion Detection Research".
Metric (Synonyms) | Type | Formula | Description | Better Score | value range |
---|---|---|---|---|---|
True Negatives (TN) | Point-based | -- | Number of benign entries classified as benign. | high | |
True Positives (TP) | Point-based | -- | Number of attack entries detected by the IIDS. | high | |
False Negatives (FN) | Point-based | -- | Number of attack entries not detected by the IIDS. | low | |
False Positives (FP) | Point-based | -- | Number of benign entries alerted by the IIDS. | low | |
Accuracy (Rand Index) | Point-based | Accuracy captures the overall proportion of correct classifications. The higher the accuracy score is, the more reliable the predictions of the IIDS are. Synonyms: Rand Index. | high | ||
Precision (PPV, Confidence) | Point-based | Precision is the proportion of correct classifications among all positive classifications (entries classified as malicious). It captures the validness of positive classifications. Synonyms: PPV, Confidence. | high | ||
Inverse-Precision (NPV, TNA) | Point-based | In contrast to precision, inverse precision is the proportion of benign entries correctly classified as benign. Synonyms: NPV, TNA. | high | ||
Recall (TPR, Sensitivity, Hit Rate) | Point-based | Recall states how many malicious entries of the dataset are actually detected by an IIDS. It captures the completeness of positive classifications. Synonyms: TPR, Sensitivity, Hit Rate. | high | ||
Inverse-Recall (TNR, Specificity, Selectivity) | Point-based | Inverse recall is the proportion of classifications as benign behavior that are correct. Synonyms: TNR, Specificity, Selectivity. | high | ||
Per-Scenario Recall | Point-based | Recall measurement on a per-attack-scenario basis. | high | ||
Fallout (FPR) | Point-based | Fallout calculates the fraction of false alarms across the dataset. Synonyms: FPR. | low | ||
Missrate (FNR) | Point-based | Missrate measures the fraction of missed malicious entries. Synonyms: FNR. | low | ||
Informedness (Youden's J statistic) | Point-based | Informedness aggregates recall and inverse recall, measuring how informed the IIDS is, i.e. the completeness of both positive and negative classifications. Synonyms: Youden's J statistic. | high | ||
Markedness | Point-based | Markedness aggregates precision and inverse precision, measuring the reliability of the IIDS, i.e. the validness of both positive and negative classifications. | high | ||
F-Score | Point-based | Usually, an inherent tradeoff between achieving a maximal number of detected attacks (recall) while reducing false positives (precision) exists. The F-score combines both design goals into a single metric. F1 is the harmonic mean between precision and recall. | high | ||
Matthews correlation coefficient (MCC, Phi coefficient) | Point-based | Matthew's Correlation Coefficient measures the correlation between the IIDS' classification and the ground truth. Its main advantage over the F-score is that it is not affected by over-representation of either benign of malicious entries. Synonyms: Phi coefficient. | high | ||
Jaccard Index (Tanimoto Index) | Point-based | The Jaccard index measures the similarity between the set of entries deemed to be malicious obtained from the classification and the one obtained from the ground truth. Synonyms: Tanimoto Index. | high | ||
Jaccard Distance | Point-based | The Jaccard distance is the complement to the Jaccard index, it measures the dissimilarity between the classification and the ground truth. | low | ||
Detected Scenarios | Time-aware | -- | Detected scenarios lists the attack scenarios detected by at least a single alarm. | -- | -- |
Detected Scenarios Percent | Time-aware | Proportion of attack scenarios that were detected by the IIDS. | high | ||
Penalty Score (PS) | Time-aware |
|
Penalty Score (PS) is the length of detection results outside their overlap with attack scenarios (cf. TABOR paper). | low | |
Detection Delay | Time-aware |
|
The detection delay aggregates the time intervals between the start of an attack and the time of the first detection. | low | |
True Positive Alarms (TPA) | Time-aware | -- | True positive alarms (TPA) counts the number of continuous alarms that overlap with at least a single attack. | high | |
False Positive Alarms (FPA) | Time-aware | -- | False positive alarms (FPA) counts the number of continuous alarms that do not overlap with any attack. | low | |
enhanced Time-aware Precision (eTaP) | Time-aware |
|
Hwang et al. proposed their (enhanced) time series-aware variants for classical point-based metrics, i.e., precision, recall, and F1, addressing known issues when adopting point-based metrics for time series-aware evaluations. To replace precision, for instance, eTaP implements diminishing returns for long-lasting alarms. | high | |
enhanced Time-aware Recall (eTaR) | Time-aware |
|
Similarly, while point-based recall weights long attacks as more important, the new time series-aware recall variant (eTaR) treats all consecutive attacks equally. | high | |
enhanced Time-aware F Score (eTaf) | Time-aware | The proposed eTaF score is defined in the same way as the regular F score but leverages the substitute eTaP and eTaR metrics. | high | ||
BATADAL Time to Detection (TTD) | Time-aware |
|
BATADAL time-to-detection. Normalized time until an attack is detected. | high | |
BATADAL CLF (Classification Performance) | Time-aware | BATADAL classification performance. Mean between TNR and TPR | high | ||
BATADAL | Time-aware | $\gamma \cdot \text{BATADAL}{\text{TTD}} + (1 - \gamma) \cdot \text{BATADAL}{\text{CLF}}$ | Weighted BATADAL-TTD and BATADAL-CLF | high | |
NAB-score | Time-aware |
|
The NAB score weighs the evaluation of classification results based on their relative position to attack scenarios. Rewards (for true positives) and penalities (for false positives) are scaled by the sigmoid function centered around the end of attack windows. This ensures that early detections are rewarded, while trailing false positives are only gradually penalized. | high | |
Affiliation Precision/Recall | Time-aware | -- | The Affiliation Metric implements two variants of precision and recall to solve the insufficiencies of their point-based counterparts. Moreover, this metric claims to be more resilient against adversarial algorithms and random predictions. | high |
- Olav Lamberts, Konrad Wolsing, Eric Wagner, Jan Pennekamp, Jan Bauer, Klaus Wehrle, and Martin Henze. SoK: Evaluations in Industrial Intrusion Detection Research. Journal of Systems Research (jSys 2023). 2023 http://doi.org/10.5070/SR33162445
- Konrad Wolsing, Eric Wagner, Antoine Saillard, and Martin Henze. 2022. IPAL: Breaking up Silos of Protocol-dependent and Domain-specific Industrial Intrusion Detection Systems. In 25th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2022), October 26–28, 2022, Limassol, Cyprus. ACM, New York, NY, USA, 17 pages. https://doi.org/10.1145/3545948.3545968
- Wolsing, Konrad, Eric Wagner, and Martin Henze. "Poster: Facilitating Protocol-independent Industrial Intrusion Detection Systems." Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020 https://doi.org/10.1145/3372297.3420019
If you are new to IPAL and want to learn about the general idea or try out our tutorials, please refer to IPAL's main repository: https://github.com/fkie-cad/ipal.
ipal-evaluate
requireslibgsl
(orlibgsl-dev
) to be installed. See https://www.gnu.org/software/gsl/doc/html/index.html for further information.
Use python3 -m pip install .
to install the scripts and dependencies system-wide using the pip
python package installer. This will install dependencies and the metrics
and evaluate
modules to the local site packages and add the ipal-evaluate
, ipal-plot-alerts
and ipal-plot-metrics
scripts to the PATH
. The scripts can then be invoked system-wide (e.g. ipal-evaluate -h
).
Alternatively, the project's dependencies can be installed locally in a virtual environment using the misc/install.sh
script or manually with:
python3 -m venv venv
source venv/bin/activate
python3 -m pip install numpy
python3 -m pip install -r requirements.txt
The scripts can then be invoked after activating the virtual environment from the root of the project repository, e.g.:
source venv/bin/activate
./ipal-evaluate -h
deactivate
Use docker build -t ipal-ids-metrics:latest .
to build a Docker image with a pip
installation of the project and development dependencies. The scripts can then be used within containers using the built image, e.g.:
docker run -it ipal-ids-metrics:latest /bin/bash
ipal-evaluate -h
The ipal-evaluate
tool is the main script of the project.
Given a list of IPAL messages containing IIDS classifications (the ids
field) and
ground truth (the malicious
field), it computes an evaluation of the
classifications under all the implemented metrics.
IPAL output files of IIDS implementations using the ipal-ids
framework, but also any
IPAL file where each message contains at least the ids
and malicious
fields can be
used with ipal-evaluate
.
Certain metrics may require the IPAL dataset to be timed, that is,
IPAL messages to contain a timestamp
field.
Others require the number, start and duration of attack scenarios to be
known, for which the path to an attack file must be provided via the
--attacks
command line argument.
Providing these additional pieces of information is not required, metrics
that cannot be evaluated due to missing data are simply skipped.
ipal-evaluate
's output is a report represented as a JSON object in the following format:
{
"<metric_1>": <val>,
"<metric_2>": <val>,
...
"<metric_n>": <val>,
"_evaluation-config": {}
}
The _evaluation-config
key-value pair contains the configuration under which
ipal-evaluate
was called, including command line arguments, version of the script,
and metric settings.
Metric settings are parameters affecting the computation of certain metrics such as
the gamma value in the BATADAL score, or the beta value in the F-score. Metric settings
can be set by editing their value in the evaluate.settings
module,
A complete description of command line arguments of the ipal-evaluate
script is given below:
usage: ipal-evaluate [-h] [--output FILE] [--attacks FILE] [--timed-dataset bool] [--log STR] [--logfile FILE] [--compresslevel INT] [--version] FILE
positional arguments:
FILE input file of IPAL messages to evaluate ('-' stdin, '*.gz' compressed) (Default: '-')
options:
-h, --help show this help message and exit
--output FILE output file to write the evaluation to ('-' stdout, '*.gz' compress) (Default: '-')
--attacks FILE JSON file containing the attacks from the used dataset ('*.gz' compress) (Default: None)
--timed-dataset bool is the dataset timed? Required by some metrics (True, False) (Default: True)
--log STR define logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) (Default: WARNING)
--logfile FILE file to log to (Default: stderr)
--compresslevel INT set the gzip compress level (0 no compress, 1 fast/large, ..., 9 slow/tiny) (Default: 9)
--version show program's version number and exit
Example usage:
./ipal-evaluate --log INFO misc/tests/input.state.gz --attacks misc/tests/attacks.json
This tool allows to visualize the occurrences of attacks contained in a dataset and the alerts generated by different IDSs.
IPAL files containing timestamped entries and classifications by the respective IDSs
should be passed as the positional IDS
argument and the attacks file with --attacks
.
Further options can be adjusted using other optional arguments, such as the --score
argument which allows to plot the score of specific classifiers used by IDSs.
usage: ipal-plot-alerts [-h] [--attacks attacks] [--mark-attacks list] [--draw-score score] [--draw-attack-id] [--mark-fp] [--draw-ticks] [--dataset dataset] [--min-width seconds]
[--title title] [--output output] [--log STR] [--logfile FILE] [--version]
IDS [IDS ...]
positional arguments:
IDS IDS classification files
options:
-h, --help show this help message and exit
--attacks attacks path to attacks.json file of the dataset
--mark-attacks list plot single attacks in different colour. Specify attacks by attack ids, separated by ','
--draw-score score plot the score of an IIDS, specify the IIDS name to visualize or use 'default' for a single IIDS
--draw-attack-id plot attack id on the attacks if provided by the attack file
--mark-fp plot false alerts in different colour
--draw-ticks plot ticks instead of interval ranges, e.g., for communication-based IIDSs
--dataset dataset name of the dataset to put on the plot (Default: '')
--min-width seconds Minimum width of attack bars in dataset entries (Default: 0)
--title title title to put on the plot (Default: '')
--output output file to save the plot to (Default: '': show in matplotlib window)
--log STR define logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) (Default: WARNING)
--logfile FILE file to log to (Default: stderr)
--version show program's version number and exit
Example usage:
./ipal-plot-alerts misc/tests/input.state.gz --attacks misc/tests/attacks.json --score MinMax
The ipal-plot-metrics
tool can be used to represent one or more reports
generated by ipal-evaluate
graphically as a radar chart.
The score of each evaluated classification is plotted along a
different axis for each considered metric.
This allows to visually compare the performance of multiple IIDSs under various
metrics.
The reports that should be plotted need to be passed as the results
argument, and the metrics specified with the optional --metrics
argument.
usage: ipal-plot-metrics [-h] [--metrics metrics] [--title title] [--output output]
[--log STR] [--logfile FILE] results [results ...]
positional arguments:
results list of files containing evaluation data of ipal-evaluate
options:
-h, --help show this help message and exit
--metrics metrics comma-separated list of metrics to be plotted
(Default: 'Accuracy,Precision,Recall,F1,F0.1,Detected-Scenarios-
Percent,eTaF0.1,eTaF1,eTaR,eTaP')
Possible metrics are: Confusion-Matrix,Accuracy,Precision,
Inverse-Precision,Recall,Inverse-Recall,Fallout,Missrate,
Informedness,Markedness,F-Score,MCC,Jaccard-Index,Jaccard-Distance,
Detected-Scenarios,Detected-Scenarios-Percent,Scenario-Recall,
Penalty-Score,Detection-Delay,TPA,FPA,TaPR,BATADAL-TTD,
BATADAL-CLF,BATADAL
--title title title to put on the plot (Default: '')
--output output file to save the plot to (Default: '': show in matplotlib window)
--log STR define logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
(Default: WARNING)
--logfile FILE file to log to (Default: stderr)
Example usage:
./ipal-plot-metrics misc/tests/example-output.json --metrics Accuracy,Precision,Recall,F1
TODO
usage: ipal-tune [-h] [--config FILE.py] [--restart-experiment] [--resume-errored] [--max-cpus INT] [--max-gpus INT] [--default.config] [--log STR] [--logfile FILE] [--compresslevel INT]
[--version]
options:
-h, --help show this help message and exit
--config FILE.py load IDS configuration and hyperparameters from the specified file (python file).
--restart-experiment Usually experiments are resumed. If this parameter is provided, a new experiment ist started from scratch.
--resume-errored Usually errored experiments are ignored. If this parameter is provided, errored experiments are rescheduled.
--max-cpus INT max number of CPUs to use (Default: detected available number of CPUs)
--max-gpus INT max number of GPUs to use (Default: detected available number of GPUs)
--default.config dump an exemplary default configuration
--log STR define logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) (Default: WARNING).
--logfile FILE file to log to (Default: stderr).
--compresslevel INT set the gzip compress level. 0 no compress, 1 fast/large, ..., 9 slow/tiny. (Default: 9)
--version show program's version number and exit```
#### Using the IPAL format
The `ipal-evaluate` and `ipal-plot-alerts` tools require data to
be provided in the IPAL format. Adequately-formatted data can be
generated by IDSs implemented in the IPAL-IDS framework, but suitable IPAL
data can also be generated manually from other data formats such as CSV.
For convenience a simple CSV-to-IPAl converter is provided under `misc/csv-to-ipal.py`.
The script requires an input in the form of a CSV file with at least two columns of data
containing the ground truth and the classification of the IDS respectively.
Which column contains each data type, and which other columns contain optional data
such as timestamps and classification by individual classifiers of the IDS must
be specified by the respective command line arguments.
Provided the source data contains timestamps, the script can further be used to
generate the optional attacks files by providing an output filename with
the `--attacks` argument.
The boundaries of attacks are deduced from observed sequences of entries considered
to be anomalous under the ground truth and as such might deviate from the actual attacks
defined in a dataset.
The resulting attack file is a JSON document containing an array of attack objects,
which have the required `id`, `start` and `end` attributes:
[ { "id": <attack_id>, "start": <start_timestamp>, "end": <end_timestamp>, "ipalid": }, ... ]
Complete usage information can be obtained through the `-h` command line argument:
`python3 csv-to-ipal.py -h`.
## Development
##### Tooling
The set of tools used for development, code formatting, style checking, and testing can be installed with the following command:
```bash
python3 -m pip install -r requirements-dev.txt
All tools can be executed manually with the following commands and report errors if encountered:
black .
flake8
python3 -m pytest
A black
and flake8
check of modified files before any commit can also be forced using Git's pre-commit hook functionality:
pre-commit install
More information on the black and flake8 setup can be found at https://ljvmiranda921.github.io/notebook/2018/06/21/precommits-using-black-and-flake8/
The process for adding support for a new metric is the following:
- Add a new file in the
metrics/
folder - Create a new class inheriting the Metric class (see
metrics/metric.py
).- The class should define the following attributes:
_name
:str
, name of the metric_description
:str
, brief description of the metric_requires
:List
ofstr
, list of the names of other metrics required for computation_requires_timed_dataset
:bool
, whether the metric requires the dataset to be timed_requires_attacks
:bool
, whether the metric requires an attack file to be provided_higher_is_better
:bool
, if a higher score is a better score
- The class should override the
calculate()
method: given a labeled dataset and the alerts of an IDS, this methods calculates the score of the IDS' predictions under that metric, and should return a dictionary in{metric_name: value}
format - The class may further override the
defines()
method, which should return the list of computed metrics (the keys in the dictionary retired bycalculate
). This is usually simply[cls._name]
, but can also include more than one sub-metric e.g. the F-Score metric actually defines F1, F0.5 etc.
- The class should define the following attributes:
- Add the new metric to the
metric.utils.metrics
list inmetric/utils.py
- Add a test for the newly added metric under
test/metrics/
- Add a description of the new metric to the implemented metrics table above
- Konrad Wolsing (Fraunhofer FKIE & RWTH Aachen University)
- Olav Lamberts (RWTH Aachen University)
- Frederik Basels (RWTH Aachen University)
- Patrick Wagner (RWTH Aachen University)
MIT License. See LICENSE for details.