/FusingHeterogeneousData

Fusing Heterogeneous Data for Network Asset Classification – A Two-layer Approach

Primary LanguageJupyter Notebook

Fusing Heterogeneous Data for Network Asset Classification – A Two-layer Approach

This repository contains the source codes of experiments with data fusion as presented in the paper:

O. Sedláček, V. Bartoš: Fusing Heterogeneous Data for Network Asset Classification – A Two-layer Approach. In: 2024 IEEE Network Operations and Management Symposium (NOMS 2024).

Paper abstract:

An essential aspect of cybersecurity management is maintaining knowledge of the assets in the protected network. Automated asset discovery and classification can be done using various methods, differing in reliability and the provided type of information. Therefore, deploying multiple methods and combining their results is usually needed - but this is a nontrivial task. In this paper, we describe our case of how we got to the need for such a data fusion method, how we approached it, and we present our solution - a two-layer data fusion method that can effectively fuse multiple heterogeneous and unreliable sources of information about a network device to classify it. The method is based on a combination of expert-written conditions, machine learning from small amounts of data and the Dempster-Shafer theory of evidence. We evaluate the method on the task of operating system recognition using data from real network traffic and several generated datasets simulating different conditions.

Contents:

Repository structure:

  • config - contains configuration files for experiments, attribute specifications.
  • data - contains dataset and data files generated during experiments.
  • dependencies - contains files that allow working with DP3 components.
  • pyds - the PyDS library, implementing the Dempster-Shafer theory, taken from https://github.com/reineking/pyds.
  • ruleloader - implementation of the classification rule interpreter.

Dependencies

This repository has been tested using python 3.9. The packages required by the solution can be installed using the requirements.txt file in the root directory:

$ pip install -r requirements.txt

The report generation requires a LaTeX distribution, the Makefile uses xelatex.

Dataset Manager

The repository allows switching between multiple datasets, which are stored in data/stash. This folder is automatically created and managed by the utility script dataset_manager.py. Adding a dataset is possible using:

$ python dataset_manager.py add <dataset_name> <dataset_description>

This creates a directory structure for the dataset, which can be used afterwards.

To list all datasets and the current dataset, use:

$ python dataset_manager.py status

To switch between datasets, use:

$ python dataset_manager.py switch <dataset_name>

When switching datasets, the complete state of generated experiment results is preserved, except for pdf visualizations.

The repository contains the real network dataset used in the paper, titled paper. You might have to add a new empty dataset and then switch back to paper before the first run to initialize the symlinks properly.

Data path

This section briefly describes the contents of the data folder. It should contain (symlink) subfolders generated, rules, source and summaries.

The source folder contains the dataset file - data/source/dataset.csv.

To include Shodan annotations, the data/source folder should contain the file shodan_os_extracted.csv, which contains Shodan annotations in the format <IP>,<OS>.

The generated folder contains all files generated by the experiments (Makefile targets experiments and experiments-long). The simplest way to view the results is to run the report generation target make report and view the generated pdf file latex/report.pdf. The data path between experiments is taken from the config/experiments.yml file, which contains a detailed description of the data flow.

Running experiments

To initialize the repository for running experiments, use the Makefile target init:

$ make init

To run all experiments (in the presence of a dataset), use the created Makefile. It is recommended to use the make command with the -j parameter for parallel execution of experiments, where possible.

$ make -j 8

This runs the report target, which includes measurements of individual modules, data fusion at the module level and training of rules with data fusion at this level, including the last experiment with optimization of trust assignment by modularity. The experiment results are then available in the form of a report in the latex/report.pdf file. In case you do not want to install a LaTeX distribution, you can still view the generated plots in the latex/generated folder.

The make clean target removes all generated files, except for the dataset.

Generating Datasets

When starting work on synthetic datasets, it is necessary to set the configuration appropriately using make init-synth. At the same time, before generating, it is advisable to switch to the new dataset using python dataset_manager.py add <dataset_name>, otherwise the current dataset will be overwritten. The synthesis_param_editor.ipynb notebook allows editing of synthetic dataset parameters. In the current configuration, the generated datasets are stored in config/synthesis/. To generate a dataset from the parameters, it is necessary to place the parameters in the config/synthesis/current/ folder and run the make target make synthetic_dataset_current. The resulting dataset is stored in place of the current dataset. Then you can work with the dataset like any other, generate the results of experiments, report, e.g. using make report -j 8.

Hourglass

The hourglass.py script automates this process of obtaining experiment results from configurations. Currently, two paths are hardcoded, config/hourglass_in and config/hourglass_out, which the script monitors. The parameters of the datasets from synthesis_param_editor.ipynb to be processed are placed in config/hourglass_in. After starting, the script processes all inserted configurations, while the processed ones are moved to config/hourglass_out, along with the log of the standard and error output and the generated report of the experiments. The script runs indefinitely, if the configuration is not available, a 30 second sleep loop is set. The script can be run using python hourglass.py, for synthetic datasets it is necessary to add --full-synth.


Contact: Ondřej Sedláček ondrej.sedlacek@cesnet.cz

Copyright (C) 2022-2024 CESNET, z.s.p.o.

License: CC-BY 4.0