This project's main goal is to classify benign files and variants of malware files into their respective families. It consist of 2 parts, each one has a different approach to achieve this goal: one with XGBoost model and the second with deep learning model.
XGBoost Model: Machine learning model using xgboost - a scalable and accurate implementation of gradient boosting machines.
Second model - Deep Learning: Classification model based on convolutional network taken from 'Malware Detection by Eating a Whole EXE' paper written by Edward Raf, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro and Charles Nicholas (link: https://arxiv.org/pdf/1710.09435.pdf).
XGBoost model was used over one class of benign files and 3 classes of malware taken from the Kaggle contest of 2015. Each file is Windows8 PE without the PE header. Those are the results:
Train | Test | |
---|---|---|
Accuracy | 99.98 | 99.62 |
Deep learning model was used one time as multiclass classifier for 3 classes and second time as binary classifier.
Accuracy | Train | Test |
---|---|---|
Binary | 99.95 | 97.18 |
Multiclass | 95.32 | 90.11 |
- XGBoost
- PyTorch
- PEFile
- Capstone
- sklearn
- Numpy
git clone https://github.com/EdenShuker/Malware-Detection.git
This part uses the 'run.py' script in the directory 'deep_code'.
Run by:
python3 run.py [options]
Options are:
- [-train configuration_file]
- [-save model_filename]
- [-load model_filename]
- [-eval configuration_file]
- [-test configuration_file]
NOTE - explanations on the '.yaml' files (configuration files) are after the parameters description.
-
Add '-train configuration_file' in order to train a model.
configuration_file is a path to a '.yaml' file containing the configurations of the training.
-
Add '-save model_filename' in order to save the model after training.
model_filename is a path to the file that will be created, will contain the saved model.
-
Add '-load model_filename' in order to load a saved model.
model_filename is the same as in the save option.
-
Add '-eval configuration_file' in order to evaluate your model on a dataset.
configuration_file is a path to a '.yaml' file containing the configurations of the validation.
-
Add '-test configuration_file' in order to do a blind prediction on files.
configuration_file is a path to a '.yaml' file containing the configurations of the testing.
Each line in '.yaml' file is a key-value pair in the format - 'key: value' .
Keys:
-
main_dir - (string) like 'some/path', path to the main directory, which contains sub-directories, each sub-directory contains files of the respective family.
-
first_n_byte - (int) number of bytes to read from each file in order to classify it.
-
lr - (float) learning rate.
-
num_epochs - (int) number of epochs for training.
-
labels - (string) path to a file, each line in it is a label name.
-
labels2dir - (string) path to '.csv' file, each line in it is in the format of 'label,dir'. maps a labels name to the name of the sub-directory in the main-directory.
-
batch - (int) batch size to use in the train and dev data-loaders.
-
workers - (int) number of workers to use in the train and dev data-loaders.
-
conf_mat - (boolean) like True or False, set True for showing the confusion matrix in the last evaluation on dev.
-
files_ls_path - (string) path to a file where each line in it is a path to a file to predict on.
-
target_file - (string) path to a file that will be created after the prediction, line[i] in it is the label of files[i] in the file that was passed in FILES_LS_PATH.
The keys needed for each '.yaml' file:
Train | Eval | Test |
---|---|---|
main_dir | main_dir | files_ls_path |
first_n_bytes | labels | labels2dir |
lr | labels2dir | workers |
num_epochs | batch | first_n_bytes |
labels | workers | target_file |
labels2dir | first_n_bytes | |
batch | conf_mat | |
workers | ||
conf_mat |
This part uses the 'run.py' script in the directory 'ml_code'. This script uses parameters defined in the 'config_ml.yaml' file. You can change those parameters as you wish.
Run by:
python3 run.py
Each line in '.yaml' file is a key-value pair in the format - 'key: value' .
Keys:
-
num_classes - (int) number of classas you want to distinguish between.
-
filePath2label - (string) path to csv file contains mapping from file path to its label for each file in train set.
-
dir_malware_files - (string) path to the directory where the malware files exist in (.bytes and .asm format for each file).
-
dir_benign_dll - (string) path to the directory where the '.dll' of the benign files exist in.
-
dir_benign_bytes - (string) path to the directory where the '.bytes' of the benign files exist in.
-
show_matrix - (boolean) 'True' to print confusion matrix, 'False' otherwise.
-
train - (boolean) 'True' to train the model from start, 'False' otherwise (note: if you chose false, you need to set the 'load' key to true).
-
save - (boolean) 'True' to save the model into a file, 'False' otherwise.
-
model_save_name - (string) The name of the saved model.
-
load - (boolean) 'True' if you want to load an already trained model, 'False' otherwise.
-
model_load_name - (string) The name (or path) of the model we want to load.
-
test - (boolean) 'True' to test the model, 'False' otherwise. If you set test to true, an output file 'test.output' will be created. note: By default do the testing over the dev set. In order to test the model over a new set, you should change a little bit the script and give the f2v.file of the new set as parameter to 'model.py'.