/diaphora-ml

Diaphora Machine Learning tools and datasets

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Diaphora ML

This repository contains the Diaphora Machine Learning tools and the datasets used by Diaphora to generate the distributed ML model.

What is in this repository?

The directory tools contains the following tools:

  • diaphora-export.sh: A bash shell script to export with Diaphora a binary file. Please edit it before to setup your IDA and Diaphora paths.
  • create_dataset.py: Python tool to build a dataset cross comparing binaries that the tool believe are the same. Currently, only the following 2 rules are considered: 1) 99% of the functions have the same address, or 2) 95% of the function names are the same (ignoring auto-generated ones).
  • train_dataset.py: Python tool to train a dataset and build model or check an already trained model against a different dataset. It supports a myriad of other use-cases, please see its help for more details.
  • validate_against_diff.py: Python tool to check a ML model against a saved Diaphora diffing results database. It's mostly used to identify potential false positives with a trained ML model.

The directory datasets contains the following files:

  • cisco-talos-dataset-2.csv.bz2: A BZ2 compressed CSV file with the generated dataset cross comparing similar binaries from the Cisco Talos Dataset-2.
  • diaphora-amalgamation.csv.bz2: A BZ2 compressed CSV file with the Diaphora's dataset called "the amalgamation", containing the cross comparison data for the Diaphora's testing suite, issues files, and the Cisco Talos Dataset-2 files.

The directory example contains the following example dataset and script:

  • mini-dataset: A directory with a few binary executables. Just an example dataset.
  • run.sh: Example script that exports with Diaphora every single binary in the mini-dataset directoy, generates a CSV files cross comparing all the exported binaries, splits the created dataset into the usual training, validation and test datasets; and then trains a model (using a decision tree algorithm) and verifies it against the test and validation datasets. Please remember to change in the file /tools/diaphora-export.h the paths for IDA and Diaphora.

The directory docs contains a presentation given in the conference 44CON describing Machine Learning as implemented in Diaphora as of September 2024.