Introduction

In this project, we're doing a research on comparison between the performance of hybrid machine learning LightGBM + XGBoost + Logistic Regression and the individual machine learning LightGBM in detecting a malware with Portable Executable Format, and make an application based on the hybrid and LightGBM algorithm that has already been made. The results this research, is that hybrid machine learning performance is better than the individual machine learning, which is LightGBM, so in this project we're gonna make a malware detection application based on the hybrid machine learning. To see the full explanation about this project, you can go check the link in the citing section. (NOTE: THIS IS NOT A FULLY WORKING APPLICATION AT THE MOMENT, BECAUSE I MAKE THIS IN A VERY SHORT TIME, ALSO I ONLY NEED A VISUALIZATION OF THE PERFORMANCE RESULT BETWEEN THOSE TWO MACHINE LEARNING ALGORITHM)

Dataset

There are 2 dataset that i considered to use in this research, and those datasets are Bodmas and Ember datasets. After looking at the pros and cons between those two datasets on the impact to this project, i decided to use the Bodmas dataset for this research, which contains 57,293 malware and 77,142 benign Windows PE files. For the full explanation of those two datasets i've talked about, you can visit their respected Github repository:

Installation

In this section, i'm gonna tell you how to install this repo on your computer.

  1. Before we get started, make sure you have 5 GB free storage, so you can download this repository without occurring any problem. Also, we recommend you to set up your python environment to Python 3.6.8 (other Python 3.6 or above versions might also work, but we have never tested it yet).

  2. Download this repository by clicking the code -> Download ZIP or code -> Github Desktop or by copying this code into your git bash

git clone https://github.com/fauzan923/Malware-Detection-on-PE-File-using-Hybrid-Machine-Learning.git
  1. Download the required package dependencies
pip install requirements.txt

Configuration

To run the program on this project, open the Main.ipynb file using jupyter notebook and click run all. To use a different dataset change the dataset file extension inside of the Main.ipynb, for example:

.csv File Extension

To use a dataset with a .csv file extension, change this code

filename = 'Bodmas/bodmas.npz'
data = np.load(filename)
X = data['X']  # all the feature vectors
y = data['y']  # labels, 0 as benign, 1 as malicious

Into this code

filename = 'csv_file.csv'
X = data['X']  # all the feature vectors
y = data['y']  # labels, 0 as benign, 1 as malicious

Experiment Results

Confusion Matrix LightGBM Confusion Matrix Hybrid
ROC Plot

Based on the research that has been done, the proposed hybrid machine-learning algorithm is shown to have a better performance in detecting malware than any of the base models used to make the proposed hybrid machine learning model. However, even though the other metrics besides recall have the same or even lower value than the base model or the compared algorithm (LightGBM), it still improves recall, and recall metrics are the main focus of this research. The proposed hybrid machine learning recall value was 99.5026%, which was the highest recall value compared to each base model (LGBM + XGB + LR) that had 99.4480% 99.5004%, and 98.0539%, respectively. However, even though the proposed hybrid machine learning performed better than LightGBM, some improvements can still be made. In future research, it is suggested to use much more datasets and the most updated datasets than this research, and it is suggested to use another algorithm as a hybrid machine learning algorithm, and combine more than three algorithms into a hybrid machine learning algorithm, and use another machine learning algorithm methods that may be able to produce a higher accuracy value than the proposed hybrid machine learning in this research.

Authors

Citing

If you use this repository in a publication please cite the following paper:

@inproceedings{ramadhan2021analysis,
  title={Analysis Study of Malware Classification Portable Executable Using Hybrid Machine Learning},
  author={Ramadhan, Fauzan Hikmah and Suryani, Vera and Mandala, Satria},
  booktitle={2021 International Conference on Intelligent Cybernetics Technology \& Applications (ICICyTA)},
  pages={86--91},
  year={2021},
  organization={IEEE}
}