Malware Classification with Hidden Markov Models and Support Vector Machines

HMM ROC curve for winwebsec vs zbot	Stacked HMM and SVM ROC curve for winwebsec vs zbot

This project is an implementation of a Hidden Markov Model (HMM) in java to classify malware families. Stacked HMMs and used Support Vector Machine (SVM) as a metaclassifier to improve the results. Used AUC as the main metric.

HMM Introduction

HMMs are used to solve three main problems:

Problem 1: Given the model λ = (A, B, π) and a sequence of observations O, find P(O | λ). Here, we want to determine a score for the observed sequence O with respect to the given model λ.
Problem 2: Given λ = (A, B, π) and an observation sequence O, find an optimal state sequence for the underlying Markov process.
Problem 3: Given an observation sequence O and the dimensions N and M, find the model λ = (A, B, π) that maximizes the probability of O.

Dataset Description

The dataset Malicia contains 48 malware families. For this project, I only worked with winwebsec, zbot, and zeroAccess because those families had the majority of the samples. Each sample contained the extracted operational codes (opcodes) from the binaries of the malware sample.

Description

The opcodes were counted to determine which opcodes appeared most frequently in each of the malware families. The top M number of opcodes were used as a hyperparameter for the HMM. Each HMM was trained on 1200 samples and tested on 100 samples from the same family and 100 samples from the same different family. A total of six experiments were performed:

winwebsec vs zbot
winwebsec vs zeroAccess
zbot vs winwebsec
zbot vs zeroAccess
zeroAccess vs winwebsec
zeroAccess vs zbot

Due to computational limitations the observation sequence was limited to 70k. To determine the performance of the HMM, the area under the ROC curve was computed and displayed. To improve the classification results, three HMMs were stacked and used as an input as a feature vector into the SVM.

Results

Using only HMMs to classifiy the malware samples yielded between 50% - 94% accuracy. But when stacking three HMMs together and using a SVM as a metaclassifier the results were between 85% - 100% accuracy.

Project/File Structure

"Malicia" contains the data used for this project
"featureVectors" contains the scores of the three stacked HMMs for each experiment
"output" contains the saved HMMs, opcode counts, and testing results
"src" contains the java code used to implement the HMM, do the preprocessing, and run the experiments
"concat.ipynb" conatains the python code to concatenate the CSV files to create feature vectors for SVM
"hmm_results.ipynb" contains the plotted ROC curves for the trained HMMs
"stacking_with_svm_results.ipynb" contains the results from used the stacked HMMs as inputs into the SVM.

Dependencies

Java 11
Python 3.7.1
pandas
matplotlib
sklearn
numPy
os

Executing program

run "git clone https://github.com/BKM99/Malware-Classification-with-Hidden-Markov-Models.git"
Navigate to the project folder and run src/experiments/BestNM.java and src/experiments/FeatureVectors.java to recreate the experiments performed
run/view "hmm_results.ipynb" and "stacking_with_svm_results.ipynb" to see results

Authors

Brandon Morimoto, undergraduate at San Jose State University

Acknowledgments

HMM implementation based off of "A Revealing Introduction to Hidden Markov Models" by Mark Stamp, SJSU (http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf)

Future Work

Perform more rigorous experiments to find the best N and M for the HMMs. Try stacking more HMMs to feed into the SVM. Also, try other classification models as the metaclassifier.

gamja99/Malware-Classification-with-HMM-and-SVM