/tami

Tool for Analyzing Malware represented as Images

Primary LanguagePythonMIT LicenseMIT

Tami

TAMI (Tool for Analyzing Malware represented as Images) gathers together the code, tools, and approaches presented in some publications by Giacomo Iadarola, a PhD student at IIT-CNR and University of Pisa.

If you are using this repository, please consider citing our works (see links at the end of this README file).


Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. We highly suggest running Tami in a Docker container, especially to run experiments on the GPU. Otherwise, you can run Tami in a virtualenv (see Section virtualenv).

Run in a Docker container

SUGGESTED installation, almost mandatory for experimenting on GPU

You can run TAMI in a container built upon the tensorflow/tensorflow:2.7.0-gpu image. This is strongly suggested for handling dependencies related to GPU drivers because you only need to install Docker and the NVIDIA Docker support to work with the Tensorflow GPU support (see also Tensorflow Docker Requirements for further instructions).

You can either (1. Suggested) download the latest image from our cloud or (2.) build Tami image locally.

Download latest Tami from the cloud

In the docker/ folder of this repository, there is a script download_and_load_image.sh which downloads the latest Tami image from the cloud and load locally the image in docker. Once loaded, you can run it with run_container.sh.

Scripts Usage:

download_and_load_image.sh

run_container.sh [--no-gpu] [--quantum]

Default execution:

docker/download_and_load_image.sh
docker/run_container.sh

Build Tami locally

In the docker/ folder of this repository, there is a Dockerfile that builds the image and installs the requirements for TAMI, and two scripts (build_image.sh and run_container.sh) to handle the docker operations.

Scripts Usage:

build_image.sh [--quantum]

run_container.sh [--no-gpu] [--quantum]

Default execution:

docker/build_image.sh
docker/run_container.sh
External tools required for vectorization:

GIST DESCRIPTOR

The script install.sh should take care of the gist descriptor tool integration. If something fails, manually install the repo:

git clone https://github.com/tuttieee/lear-gist-python

Run in a Virtualenv

Tested on Ubuntu 20.04

You can run the script install.sh to set up all the necessary dependencies (excluding the GPU ones). Then, you should install all the necessary libraries with pip

pip install -r requirements/partial_requirements.txt 

Usage

There are 3 scripts that handle Tami executions: train_test.py (the main one) pre_processing.py, and post_processing.py. There are more utilities scripts in the scripts folder, such as backup data and cleaning up old results/logs. Also, the script main_literature.py (which is based on the train_test.py one) allows to train and test a specific set of DL models from the literature (for 'related works' comparison).

Train and test models

The script train_test.py allows training different DL models over a provided dataset. Also, it allows performing model assessment (through tuning the hyperparameters), save and load trained models, and output graphs and results on the training phase.

See further information on the arguments required with:

python train_test.py --help
usage: python train_test.py [-h] -m {DATA,LE_NET,ALEX_NET,STANDARD_CNN,STANDARD_MLP,CUSTOM_CNN,VGG16,VGG19,Inception,ResNet50,MobileNet,DenseNet,EfficientNet,QCNN} -d DATASET
                            [-o OUTPUT_MODEL] [-l LOAD_MODEL] [-t {hyperband,random,bayesian}] [-e EPOCHS] [-b BATCH_SIZE] [-i IMAGE_SIZE] [-w WEIGHTS] [-r LEARNING_RATE]
                            [--mode {train,train-val,train-test,test,gradcam-only}] [-v] [--exclude_top] [--no-caching] [--no-classes]

Tool for Analyzing Malware represented as Images

optional arguments:
  -h, --help            show this help message and exit

Arguments:
  -m {DATA,LE_NET,ALEX_NET,STANDARD_CNN,STANDARD_MLP,CUSTOM_CNN,VGG16,VGG19,Inception,ResNet50,MobileNet,DenseNet,EfficientNet,QCNN}, --model {DATA,LE_NET,ALEX_NET,STANDARD_CNN,STANDARD_MLP,CUSTOM_CNN,VGG16,VGG19,Inception,ResNet50,MobileNet,DenseNet,EfficientNet,QCNN}
                        Choose the model to use between the ones implemented
  -d DATASET, --dataset DATASET
                        the dataset path, must have the folder structure: training/train, training/val and test,in each of this folders, one folder per class (see dataset_test)
  -o OUTPUT_MODEL, --output_model OUTPUT_MODEL
                        Name of model to store
  -l LOAD_MODEL, --load_model LOAD_MODEL
                        Name of model to load
  -t {hyperband,random,bayesian}, --tuning {hyperband,random,bayesian}
                        Run Keras Tuner for tuning hyperparameters, options: [hyperband, random, bayesian]
  -e EPOCHS, --epochs EPOCHS
                        number of epochs
  -b BATCH_SIZE, --batch_size BATCH_SIZE
  -i IMAGE_SIZE, --image_size IMAGE_SIZE
                        FORMAT ACCEPTED = SxC , the Size (SIZExSIZE) and channel of the images in input (reshape will be applied)
  -w WEIGHTS, --weights WEIGHTS
                        If you do not want random initialization of the model weights (ex. 'imagenet' or path to weights to be loaded), not available for all models!
  -r LEARNING_RATE, --learning_rate LEARNING_RATE
                        Learning rate for training models
  --mode {train,train-val,train-test,test,gradcam-only}
                        Choose which mode run between 'train-val' (default), 'train-test', 'train', 'test'.The 'train-val' mode will run a phase of training and validation on the
                        training and validation set, the 'train-test' mode will run a phase of training on the training+validation sets and then test on the test set, the 'train'
                        mode will run only a phase of training on the training+validation sets, the 'test' mode will run only a phase of test on the test set. The 'gradcam' has been
                        moved to 'post_processing.py'
  -v, --version         show program's version number and exit
  --exclude_top         Exclude the fully-connected layer at the top of the network (default INCLUDE)
  --no-caching          Caching dataset on file and loading per batches (IF db too big for memory)
  --no-classes          In case of mode including test, skip results for each class (only cumulative results)

Logs, figure and performance results are stored in the results and tuning folders. Tensorboard can be used to print graph of training and validation trend.

tensorboard --logdir results/tensorboard/fit/

Prepare the Dataset

The script pre_processing.py provides two functionalities:

  1. split a set of files into training, validation and test set
  2. convert a dataset of files into a dataset of images.

See further information on the arguments required with:

python pre_processing.py --help
usage: python pre_processing.py [-h] -d DATASET [--mode {rgb-gray,rgb,gray,ds}] [--input {generic,apk}] [-p PERCENTAGE] [-v]

Tool for Analyzing Malware represented as Images

optional arguments:
  -h, --help            show this help message and exit

Arguments:
  -d DATASET, --dataset DATASET
                        the dataset path
  --mode {rgb-gray,rgb,gray,ds}
                        Choose which mode run between 'rgb-gray' (default), 'rgb', 'gray', and 'ds'.The 'rgb-gray' will convert the dataset in both grayscale and rgb colours, while
                        the other two modes ('rgb' and 'gray') only in rgb colours and grayscale, respectively.
  --input {generic,apk}
                        Custom image conversion or file split for some file format: -> generic: default, convert/handle the plain file -> apk: extract and convert/handle only the
                        .dex file
  -p PERCENTAGE, --percentage PERCENTAGE
                        Percentage for training, validation, and test set when --mode=ds. FORMAT ACCEPTED = X-Y-Z , which represent the training (X), validation (Y) and test (Z)
                        percentage, respectively. DEFAULT value is 80-10-10
  -v, --version         show program's version number and exit

Split dataset

The --mode=ds split the input dataset into training, validation and test set. It can be set the percentage of each set with -p <TRAINPERC>-<VALPERC>-<TESTPERC>. The input dataset MUST be in the following folder tree structure:

└─ /YOUR_DATASET 	
    ├─ /OUTPUT_CLASS_1
    |   ├─ YOUR_FILE_1
    |   ├─ ...
    |   └─ YOUR_FILE_N
    ├─ ...
    └─ /OUTPUT_CLASS_M
        ├─ YOUR_FILE_1
        ├─ ...
        └─ YOUR_FILE_K

The execution will output the dataset in the folder tree structure format required by the other Tami functionalities (such as, the input folder tree structure required by the mode --mode=[rgb-gray|rgb|gray], see next doc subsection)

Example:

python pre_processing.py -d <RAW_DATASET> --mode=ds --percentage=80-10-10

Convert dataset

The --mode=[rgb-gray|rgb|gray] converts a dataset of files into a dataset of images. The original files are cast to .png and converted into RGB or Grayscale pictures (or both). It is required to input a dataset of file splits in a folder tree structure such as the following (HINT: it is the output of --mode=ds):

└─ /YOUR_DATASET 	
    ├─ /training
    |   ├─ /train
    |   |   ├─ /OUTPUT_CLASS_1
    |   |   |   ├─ YOUR_FILE_1
    |   |   |   ├─ ...
    |   |   |   └─ YOUR_FILE_N
    |   |   ├─ ...
    |   |   └─ /OUTPUT_CLASS_M
    |   |       ├─ YOUR_FILE_1
    |   |       ├─ ...
    |   |       └─ YOUR_FILE_K
    |   └─ /val
    |       ├─ /OUTPUT_CLASS_1
    |       ├─ ...
    |       └─ /OUTPUT_CLASS_M
    └─ /test
        ├─ /OUTPUT_CLASS_1
        ├─ ...
        └─ /OUTPUT_CLASS_M

The number of samples/files for each OUTPUT_CLASS may differ, but the number of OUTPUT_CLASS MUST be the same (and also consistent with the names) in all the dataset folders (training/train, /traning/val, and /test). See the folder tree structure of DATASET/dataset_test_malware as an example.

Analyze the results

The script post_processing.py performs operations and analysis over result sets of a previous training phase. The script has 4 main functionalities:

  • Gradcam: applies the Grad-cam on a loaded (and trained) model. Many 'cam' variants are available.
  • IF/IM-SSIM: runs IF/IM-SSIM analysis on the heatmaps generated by the Gradcam(s).
  • DexWave: performs attack on a trained DL model for image-malware classification. It applies perturbations to the input samples to 'trick' the model and produce missclassifications, to drop down the model accuracy and generate malware variants.
  • Plot-Generator: creates plot using metrics stored in the .results file obtained after training session.

See further information on the arguments required with:

python post_processing.py --help
usage: python post_processing.py [-h] [-l LOAD_MODEL] [-d DATASET] [-gl SAMPLE_GRADCAM] [-gs SHAPE_GRADCAM] [-sf [SSIM_FOLDERS [SSIM_FOLDERS ...]]] [-tg TARGET_CLASS]
                                 [--mode {IFIM-SSIM,DexWave,cam-gradcam_st1,cam-gradcam_st2,cam-gradcam++,cam-scorecam,cam-scorecam_fast,cam-gradcam_st2-guided,cam-gradcam++-guided,cam-scorecam-guided,cam-scorecam_fast-guided,gradcam-cati}]
                                 [-v] [--include_all]

Tool for Analyzing Malware represented as Images

optional arguments:
  -h, --help            show this help message and exit

Arguments:
  -l LOAD_MODEL, --load_model LOAD_MODEL
                        Name of model to load
  -lf LOAD_FILE, --load_file LOAD_FILE
                        Name of result's file to load
  -d DATASET, --dataset DATASET
                        the dataset path, must have the folder structure: training/train, training/val and test,in each of this folders, one folder per class (see dataset_test)
  -gl SAMPLE_GRADCAM, --sample_gradcam SAMPLE_GRADCAM
                        Limit gradcam to X samples randomly extracted from the test set
  -gs SHAPE_GRADCAM, --shape_gradcam SHAPE_GRADCAM
                        Select gradcam target layer with at least shapeXshape (for comparing different models)
  -sf [SSIM_FOLDERS [SSIM_FOLDERS ...]], --ssim_folders [SSIM_FOLDERS [SSIM_FOLDERS ...]]
                        List of gradcam results folder to compare with IF-SSIM and IM-SSIM
  -tg TARGET_CLASS, --target_class TARGET_CLASS
                        Target class for attack model with DexWave and try to produce missclassifications
  --mode {IFIM-SSIM,DexWave,cam-gradcam_st1,cam-gradcam_st2,cam-gradcam++,cam-scorecam,cam-scorecam_fast,cam-gradcam_st2-guided,cam-gradcam++-guided,cam-scorecam-guided,cam-scorecam_fast-guided,gradcam-cati,plot-generator}
                        Choose which mode run between 'cam-*' (many option available, cam-gradcam-st1 default), 'gradcam-cati',
                        and 'IFIM-SSIM'. See all options available with --help.
  --type {accuracy,loss,both}
                        Include all possible plots that can be created.
  -v, --version         show program's version number and exit
  --include_all         Include all possible heatmaps in the IFIM-SSIM analysis (default, choose a random subset)

Authors & References

  • Giacomo Iadarola - main contributor - Djack1010 giacomo.iadarola(at)iit.cnr.it
  • Christian Peluso - cati tool - 1Stohk1
  • Francesco Mercaldo - contributor - FrancescoMercaldo
  • Fabrizio Ravelli - contributor - reFraw

If you are using this repository, please cite our work by referring to our publications (BibTex format):

@inproceedings{iadarola2021semi,
  title={A Semi-Automated Explainability-Driven Approach for Malware Analysis through Deep Learning},
  author={Iadarola, Giacomo and Casolare, Rosangela and Martinelli, Fabio and Mercaldo, Francesco and Peluso, Christian and Santone, Antonella},
  booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},
  pages={1--8},
  year={2021},
  organization={IEEE}
}

@inproceedings{gerardi2021perturbation,
  title={Perturbation of Image-based Malware Detection with Smali level morphing techniques},
  author={Gerardi, Federico and Iadarola, Giacomo and Martinelli, Fabio and Santone, Antonella and Mercaldo, Francesco},
  booktitle={2021 IEEE Intl Conf on Parallel \& Distributed Processing with Applications, Big Data \& Cloud Computing, Sustainable Computing \& Communications, Social Computing \& Networking (ISPA/BDCloud/SocialCom/SustainCom)},
  pages={1651--1656},
  year={2021},
  organization={IEEE}
}

@article{iadarola2021towards,
  title={Towards an Interpretable Deep Learning Model for Mobile Malware Detection and Family Identification},
  author={Iadarola, Giacomo and Martinelli, Fabio and Mercaldo, Francesco and Santone, Antonella},
  journal={Computers \& Security},
  pages={102198},
  year={2021},
  publisher={Elsevier}
}

Sub-repositories

List of other repositories related to this one, specifically created for a project/work/paper and containing only a subset of files, the necessary ones.

Acknowledgements

The authors would like to thank the 'Trust, Security and Privacy' research group within the Institute of Informatics and Telematics (CNR - Pisa, Italy), which support their research.

The Grad-CAM code is based on the work of Adrian Rosebrock, available here.