This repository contains the source codes of experiments with data fusion as presented in the paper:
O. Sedláček, V. Bartoš: Fusing Heterogeneous Data for Network Asset Classification – A Two-layer Approach. In: 2024 IEEE Network Operations and Management Symposium (NOMS 2024).
Paper abstract:
An essential aspect of cybersecurity management is maintaining knowledge of the assets in the protected network. Automated asset discovery and classification can be done using various methods, differing in reliability and the provided type of information. Therefore, deploying multiple methods and combining their results is usually needed - but this is a nontrivial task. In this paper, we describe our case of how we got to the need for such a data fusion method, how we approached it, and we present our solution - a two-layer data fusion method that can effectively fuse multiple heterogeneous and unreliable sources of information about a network device to classify it. The method is based on a combination of expert-written conditions, machine learning from small amounts of data and the Dempster-Shafer theory of evidence. We evaluate the method on the task of operating system recognition using data from real network traffic and several generated datasets simulating different conditions.
Contents:
- Repository structure
- Dependencies
- Dataset Manager - utility script for dataset management
- Data path - description of the flow of data in experiments
- Running experiments and report generation
- Generating Datasets - how to generate synthetic datasets
- Hourglass - a supervisor script for generating multiple reports in a row
config
- contains configuration files for experiments, attribute specifications.data
- contains dataset and data files generated during experiments.dependencies
- contains files that allow working with DP3 components.pyds
- the PyDS library, implementing the Dempster-Shafer theory, taken from https://github.com/reineking/pyds.ruleloader
- implementation of the classification rule interpreter.
This repository has been tested using python 3.9.
The packages required by the solution can be installed using the requirements.txt
file in the root directory:
$ pip install -r requirements.txt
The report generation requires a LaTeX distribution, the Makefile uses xelatex
.
The repository allows switching between multiple datasets, which are stored in data/stash
.
This folder is automatically created and managed by the utility script dataset_manager.py
.
Adding a dataset is possible using:
$ python dataset_manager.py add <dataset_name> <dataset_description>
This creates a directory structure for the dataset, which can be used afterwards.
To list all datasets and the current dataset, use:
$ python dataset_manager.py status
To switch between datasets, use:
$ python dataset_manager.py switch <dataset_name>
When switching datasets, the complete state of generated experiment results is preserved, except for pdf visualizations.
The repository contains the real network dataset used in the paper, titled paper
.
You might have to add a new empty dataset and then switch back to paper
before the first run to initialize the symlinks properly.
This section briefly describes the contents of the data
folder. It should contain (symlink) subfolders generated
, rules
, source
and summaries
.
The source
folder contains the dataset file - data/source/dataset.csv
.
To include Shodan annotations, the data/source
folder should contain the file shodan_os_extracted.csv
, which contains Shodan annotations in the format <IP>,<OS>
.
The generated
folder contains all files generated by the experiments (Makefile targets experiments
and experiments-long
).
The simplest way to view the results is to run the report generation target make report
and view the generated pdf file latex/report.pdf
.
The data path between experiments is taken from the config/experiments.yml
file, which contains a detailed description of the data flow.
To initialize the repository for running experiments, use the Makefile target init:
$ make init
To run all experiments (in the presence of a dataset), use the created Makefile.
It is recommended to use the make command with the -j
parameter for parallel execution of experiments, where possible.
$ make -j 8
This runs the report
target, which includes measurements of individual modules,
data fusion at the module level and training of rules with data fusion at this level,
including the last experiment with optimization of trust assignment by modularity.
The experiment results are then available in the form of a report in the latex/report.pdf
file.
In case you do not want to install a LaTeX distribution, you can still view the generated plots in the latex/generated
folder.
The make clean
target removes all generated files, except for the dataset.
When starting work on synthetic datasets, it is necessary to set the configuration appropriately using make init-synth
.
At the same time, before generating, it is advisable to switch to the new dataset using python dataset_manager.py add <dataset_name>
,
otherwise the current dataset will be overwritten.
The synthesis_param_editor.ipynb
notebook allows editing of synthetic dataset parameters.
In the current configuration, the generated datasets are stored in config/synthesis/
.
To generate a dataset from the parameters, it is necessary to place the parameters in the config/synthesis/current/
folder and run
the make target make synthetic_dataset_current
. The resulting dataset is stored in place of the current dataset.
Then you can work with the dataset like any other, generate the results of experiments, report, e.g. using make report -j 8
.
The hourglass.py
script automates this process of obtaining experiment results from configurations.
Currently, two paths are hardcoded, config/hourglass_in
and config/hourglass_out
, which the script monitors.
The parameters of the datasets from synthesis_param_editor.ipynb
to be processed are placed in config/hourglass_in
.
After starting, the script processes all inserted configurations, while the processed ones are moved to config/hourglass_out
,
along with the log of the standard and error output and the generated report of the experiments.
The script runs indefinitely, if the configuration is not available, a 30 second sleep loop is set.
The script can be run using python hourglass.py
, for synthetic datasets it is necessary to add --full-synth
.
Contact: Ondřej Sedláček ondrej.sedlacek@cesnet.cz
Copyright (C) 2022-2024 CESNET, z.s.p.o.
License: CC-BY 4.0