Malware Detection

This project's main goal is to classify benign files and variants of malware files into their respective families. It consist of 2 parts, each one has a different approach to achieve this goal: one with XGBoost model and the second with deep learning model.

XGBoost Model: Machine learning model using xgboost - a scalable and accurate implementation of gradient boosting machines.

Second model - Deep Learning: Classification model based on convolutional network taken from 'Malware Detection by Eating a Whole EXE' paper written by Edward Raf, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro and Charles Nicholas (link: https://arxiv.org/pdf/1710.09435.pdf).

Results:

XGBoost Model:

XGBoost model was used over one class of benign files and 3 classes of malware taken from the Kaggle contest of 2015. Each file is Windows8 PE without the PE header. Those are the results:

	Train	Test
Accuracy	99.98	99.62

Deep Learning Model:

Deep learning model was used one time as multiclass classifier for 3 classes and second time as binary classifier.

Accuracy	Train	Test
Binary	99.95	97.18
Multiclass	95.32	90.11

Requirements:

XGBoost
PyTorch
PEFile
Capstone
sklearn
Numpy

Install:

        git clone https://github.com/EdenShuker/Malware-Detection.git

Running Instructions:

Deep Learning Model:

This part uses the 'run.py' script in the directory 'deep_code'.

Run by:

  python3 run.py [options]

Options are:

[-train configuration_file]
[-save model_filename]
[-load model_filename]
[-eval configuration_file]
[-test configuration_file]

Arguments:

NOTE - explanations on the '.yaml' files (configuration files) are after the parameters description.

Add '-train configuration_file' in order to train a model.

configuration_file is a path to a '.yaml' file containing the configurations of the training.
Add '-save model_filename' in order to save the model after training.

model_filename is a path to the file that will be created, will contain the saved model.
Add '-load model_filename' in order to load a saved model.

model_filename is the same as in the save option.
Add '-eval configuration_file' in order to evaluate your model on a dataset.

configuration_file is a path to a '.yaml' file containing the configurations of the validation.
Add '-test configuration_file' in order to do a blind prediction on files.

configuration_file is a path to a '.yaml' file containing the configurations of the testing.

Configuration Files:

Each line in '.yaml' file is a key-value pair in the format - 'key: value' .

Keys:

main_dir - (string) like 'some/path', path to the main directory, which contains sub-directories, each sub-directory contains files of the respective family.
first_n_byte - (int) number of bytes to read from each file in order to classify it.
lr - (float) learning rate.
num_epochs - (int) number of epochs for training.
labels - (string) path to a file, each line in it is a label name.
labels2dir - (string) path to '.csv' file, each line in it is in the format of 'label,dir'. maps a labels name to the name of the sub-directory in the main-directory.
batch - (int) batch size to use in the train and dev data-loaders.
workers - (int) number of workers to use in the train and dev data-loaders.
conf_mat - (boolean) like True or False, set True for showing the confusion matrix in the last evaluation on dev.
files_ls_path - (string) path to a file where each line in it is a path to a file to predict on.
target_file - (string) path to a file that will be created after the prediction, line[i] in it is the label of files[i] in the file that was passed in FILES_LS_PATH.

The keys needed for each '.yaml' file:

Train	Eval	Test
main_dir	main_dir	files_ls_path
first_n_bytes	labels	labels2dir
lr	labels2dir	workers
num_epochs	batch	first_n_bytes
labels	workers	target_file
labels2dir	first_n_bytes
batch	conf_mat
workers
conf_mat

XGBoost Model:

This part uses the 'run.py' script in the directory 'ml_code'. This script uses parameters defined in the 'config_ml.yaml' file. You can change those parameters as you wish.

Run by:

  python3 run.py

Configuration Files:

Each line in '.yaml' file is a key-value pair in the format - 'key: value' .

Keys:

num_classes - (int) number of classas you want to distinguish between.
filePath2label - (string) path to csv file contains mapping from file path to its label for each file in train set.
dir_malware_files - (string) path to the directory where the malware files exist in (.bytes and .asm format for each file).
dir_benign_dll - (string) path to the directory where the '.dll' of the benign files exist in.
dir_benign_bytes - (string) path to the directory where the '.bytes' of the benign files exist in.
show_matrix - (boolean) 'True' to print confusion matrix, 'False' otherwise.
train - (boolean) 'True' to train the model from start, 'False' otherwise (note: if you chose false, you need to set the 'load' key to true).
save - (boolean) 'True' to save the model into a file, 'False' otherwise.
model_save_name - (string) The name of the saved model.
load - (boolean) 'True' if you want to load an already trained model, 'False' otherwise.
model_load_name - (string) The name (or path) of the model we want to load.
test - (boolean) 'True' to test the model, 'False' otherwise. If you set test to true, an output file 'test.output' will be created. note: By default do the testing over the dev set. In order to test the model over a new set, you should change a little bit the script and give the f2v.file of the new set as parameter to 'model.py'.

WillJohn/Malware-Detection

Malware Detection

Results:

XGBoost Model:

Deep Learning Model:

Requirements:

Install:

Running Instructions:

Deep Learning Model:

Arguments:

Configuration Files:

XGBoost Model:

Configuration Files: