/robust-call-graph-malware-detection

Official Repository of "Robust Malware Classification via Deep Graph Networks on Call Graph Topologies" (ESANN 2021)

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Robust Malware Classification via Deep Graph Networks on Call Graph Topologies

Description

This repository allows to reproduce the experiments of our ESANN 2021 paper:

Errica Federico, Iadarola Giacomo, Martinelli Fabio, Mercaldo Francesco, Micheli Alessio: Robust Malware Classification via Deep Graph Networks on Call Graph Topologies, European Symposium on Artificial Neural Networks (ESANN), 2021.

Requirements

  • Use this link to fetch the compressed dataset to be processed
  • PyDGN (we used PyDGN 0.5.0)

Building the dataset

Once you have unzipped the dataset file DATA_NOFEATS.zip, run the following:

  1. Original dataset

    python build_dataset.py --config-file DATA_CONFIGS/config_CNRMalwareDataset_NOFEATS.yml

  2. Obfuscated test set

    python build_dataset.py --config-file DATA_CONFIGS/config_CNRMalwareDataset_NOFEATS_OBF_TEST.yml

Launching the experiments

The PyDGN config files are set to use a gpu (see device: cuda and similar). If you want to use CPUs only, set --max-gpus 0 and change the config files accordingly (see here how). Also, you can remove --debug to enable the CLI and exploit CPU/GPU task parallelism. Adjust the parallelism parameters as you see fit.

  1. Baseline

    python launch_experiment.py --config-file MODEL_CONFIGS/config_baseline.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug

  2. CGMM

    Pre-condition: modify the config files to set up a folder where to store the intermediate graph embeddings that will be used by the classifier.

    Note: both the result folder may end up taking a lot of space to produce intermediate outputs between layers. These files are deleted after each experiment ends, but this might cause troubles when running experiments in parallel. Please consider using a secondary storage as your result folder.

    Unsupervised Embedding Phase

    python launch_experiment.py --config-file MODEL_CONFIGS/config_CGMM_Embedding.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug

    Supervised Classifier Phase

    python launch_experiment.py --config-file MODEL_CONFIGS/config_CGMM_Classifier.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug

Inference on obfuscated dataset

Once you have completed the experiments, use the notebook CGMM Inference to perform inference (remember to change the exp paths accordingly) and Confusion Matrix to output the confusion matrix.

Troubleshooting

If you have questions, do not hesitate to contact us! If you find a bug, please open an issue on Github.