The software communicates with the Neo4j database via the API provided by Py2neo, so the existence of a Neo4j database is necessary for the software to work correctly. The Neo4j database URL must therefore be specified in the config/config.ini
configuration file. Then, if the database is still empty, there are two ways to initialize it:
- Import a dump file.
- ELF binaries and their VirusTotal reports.
For initialization with an already existing database, we recommend our dataset called CrySyS-IoT-MMDB-2024. To import a database into Neo4j, issue the following command:
neo4j-admin database load --from-path=/var/lib/neo4j/import --overwrite-destination=true --verbose neo4j
For the second method in the config/config.ini
configuration file, the [Initialize_graph]
block should be set, where the absolute path of the root directory of the ELF binaries and the VirusTotal reports should be entered.
Since the graph database is based on similarity, the metadata of ELF binaries processed in dump files is separate per architecture. The software currently supports ARM and MIPS architectures.
Run the main.py
file from the src
folder with at least python 3.10.
initalize_graph
is true if the graph has not yet been initialized, otherwise it should be false if a neo4j graph already exists.dataset_src_dir_path
is the absolute path of the root directory of the ELF binaries.VT_report_src_dir_path
is the absoulute path of the root directory of the corresponding VirusTotal reports.
initialize_dataset
is false if the initialization folders do not exist, otherwise it is true.dataset_src_root_dir_path
is the path to the root directory of the dataset.dataset_input_src_dir_path
- The software creates a folder structure for processing ELF binaries in the dataset, the path to which is automatically inserted in this option of the configuration file.
arch
determines the architecture of the ELF binaries that will be used to build the database.
threshold
- We use the TLSH to compute the similarity between two nodes in the graph, and the threshold is used to determine whether or not two nodes are considered similar.
neo4j_uri
is the URI of the Neo4j database to connect to (by default bolt://localhost:7687).neo4j_user
is the name of the user account that will be used to access the database (by default neo4j).neo4j_password
is the password of the user account that will be used to access the database (by default neo4j).
accept
should beapplication/json
.apikey
is the VirusTotal API key.
The output folder contains three additional subfolders that store the information generated during the processing of binaries as follows:
local_reports
: For each ELF binary,json
files containing the metadata defined in thegraph-based-malware-db-json.schema
file.temp_graphs
: Temporary graph files.VT_analyses
: The results of VirusTotal Analyses for each ELF binaries.VT_reports
: The results of VirusTotal Reports for each ELF binaries.
To use the software correctly, it is essential to have a valid VirusTotal API v3 key.
Name | Version | Available at |
---|---|---|
Neo4j | 5.18.1 | https://neo4j.com/ |
PyExifTool | 0.5.6 | https://pypi.org/project/PyExifTool/ |
bintropy | https://github.com/packing-box/bintropy | |
python-tlsh | 4.5.0 | https://pypi.org/project/python-tlsh/ |
jsonschema | 4.22.0 | https://github.com/python-jsonschema/jsonschema |
AVClass | v2 | https://github.com/malicialab/avclass |
NetworkX | 3.3 | https://networkx.org/ |
Py2neo | 2021.1 | https://neo4j-contrib.github.io/py2neo/ |
The research presented in this paper was supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory and by the European Union’s Horizon Europe Research and Innovation Program through the DOSS Project (Grant Number 101120270). The presented work also builds on results of the SETIT Project (2018-1.2.1-NKP-2018-00004), which was implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the 2018-1.2.1-NKP funding scheme. The authors are also thankful to VirusTotal for the academic license provided for research purposes.