/malware_detect2

Malware Classification using Machine learning

Primary LanguagePython

Malware Classification using classical Machine Learning and Deep Learning

This repository is the official implementation of the research mentioned in the chapter "An Empirical Analysis of Image-Based Learning Techniques for Malware Classification" of the Book "Malware Analysis Using Artificial Intelligence and Deep Learning"

The book or chapters can be purchased from: https://link.springer.com/chapter/10.1007/978-3-030-62582-5_16

The arXiv eprint is at: https://arxiv.org/abs/2103.13827

alt text

Abstract

In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role—specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.

Quick Notes:

  • Classic ML-based approaches tried : K-NN, Random Forest, and XGBoost
  • Deep Learning-based approaches tried: ANN, CNN, LSTM, and GRU
  • Implementation is using sklearn, numpy, pandas and pytorch.
  • MS Windows executable binary files are used as data.
  • Features   * Classic ML-based approaches: PE fie features are extracted and used   * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images
  • This project is an extension of https://github.com/pratikpv/malware_classification

Steps to repro

Packages requirements

  • Install pefile pythong package e.g. conda install pefile
  • Install PyTorch and other libs e.g. conda install -c pytorch torchtext. All other common dependencies should be covered by anaconda distro.
  • objdump in ubuntu. (This code is developed and tested for ubuntu-based development env)

Malware samples

 * copy the malware samples at <project_dir>/data/exec_files/exec_files. You can reach out to me for samples used in this research. Overall directory structure should look like this,

├── config.py
├── data
│             ├── exec_files
│             │             └── exec_files
│             │                 ├── adload
│             │                 ├── agent
│             │                 ├── alureon
│             │                 ├── bho
│             │                 ├── ceeinject
│             │                 ├── cycbot
│             │                 ├── delfinject
│             │                 └── fakerean
├── data_preprocess.py
├── data_utils
.
.

Data preprocessing

Execute data_preprocess.py with below mentioned options to preprocess the data.

python data_preprocess.py --extract_pe_features

python data_preprocess.py --bin_to_img

python data_preprocess.py --extract_opcodes

python data_preprocess.py --split_opcodes

Train and test models

Execute detect_malware.py with appropriate command-line args for models to train/test. e.g.

python detect_malware.py --deep_feedforward

python detect_malware.py --deep_rnn

python detect_malware.py --shallow_ml

python detect_malware.py --transfer_conv_ml

If you like our work and is useful for your research please cite this chapter/paper as:

Prajapati P., Stamp M. (2021) An Empirical Analysis of Image-Based Learning Techniques for Malware Classification. In: Stamp M., Alazab M., Shalaginov A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_16

or

@Inbook{
    Prajapati2021,
    author={Prajapati, Pratikkumar and Stamp, Mark},
    editor={Stamp, Mark and Alazab, Mamoun  and Shalaginov, Andrii},
    title={An Empirical Analysis of Image-Based Learning Techniques for Malware Classification},
    bookTitle={Malware Analysis Using Artificial Intelligence and Deep Learning},
    year={2021},
    publisher={Springer International Publishing},
    address={Cham},
    pages={411-435},
    abstract={In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role---specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.},
    isbn={978-3-030-62582-5},
    doi={10.1007/978-3-030-62582-5_16},
    url={https://doi.org/10.1007/978-3-030-62582-5_16}
}