2021-Malware-Detection-Classification

Description

Malware detection is an important process in modern computing to help protect various systems from getting infected. The goal for any project, program, or system that aims to detect malware is to prevent any malicious software from running on a user’s computer. With our project, we have aimed to assist in the battle against malicious software by creating a model that can detect and label different types of programs as either malware or benign software. For this project, we used a Deep Neural Network (DNN) model.

The architecture of our model, shown above, consists of a dense layer with relu activation, a batch normalization layer, and finally a dropout layer. As shown in the diagram, we use 10 of these layers. This project takes inspiration from the paper “Malware Analysis with Artificial Intelligence and a Particular Attention on Results Interpretability” created by Benjamin Marais, Tony Quertier, and Christophe Chesneau.

Youtube Video

https://youtu.be/BqHsMWOVJyg

Colab Notebook

https://colab.research.google.com/drive/134uvYzJ9QGpv0qjxN85GuS5xXco3R2zv?usp=sharing

Directory Guide

├── src
│   ├── ModelClass.py
│   ├── feature_vectorization.py
│   ├── features.py
│   ├── train.py
├── test
│   ├── environment_test.py
│   ├── sanity_test.py
├── .github
│   ├── workflows
│       ├── run_all_tests.yaml
├── requirements.txt
├── test-requirements.txt
├── README.md
└── .gitignore

src/ModelClass.py: script that creates a sequential model that will allow for the model to train the training script
src/feature_vectorization.py: script that creates feature vectors (essentially an array) for all files in the dataset that contains the id, hash, date, label, class, and subset
src/features.py: contains the classes that help in sorting out features within files in the dataset
src/train.py: training script for the model that takes in X_train, y_train, X_test, and y_test
test/environment_test.py: test file to see if we installed the correct environment
test/sanity_test.py: test file to see if anything broke
.github/workflows/run_all_test.yaml: contains a script that runs all the tests for each commit tocheck if it's correct
requirements.txt: contains a list of requirements that we need for our github repo
test-requirements.txt: contains a list of requirements that we need for testing
README.md: overview of repository
.gitignore: contains a list of commands to ignore

Installing EMBER

pip install git+https://github.com/elastic/ember.git

Environment Installer Instructions

pip install -r requirements.txt
pip install -r test-requirements.txt

Dataset Downloader/Checker Instructions

pip install opendatasets

import opendatasets as od
import tarfile
import ember
import os

od.download("https://ember.elastic.co/ember_dataset_2018_2.tar.bz2")
tar = tarfile.open("./ember_dataset_2018_2.tar.bz2", "r:bz2")  
tar.extractall()
tar.close()

Training Instructions (To Start Training Process & Get Trained Weights)

model = build_model()
   #train_dataset = build_dataset()
   X_train, y_train, X_test, y_test, comldf = vectorization(
       'C:\\Users\\amant\\Documents\\Anaconda_Envs\\coml_final\\ember2018\\')

   #features, labels = next(comldf)
   loss_object = loss(model, X_train, y_train, training=train_dataset)

   l = loss(model, features, labels, training=False)
   print("Loss test: {}".format(l))

   # Set up optimizer
   optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

   # Calculate a single optimization step
   loss_value, grads = grad(model, features, labels)

   print("Step: {}, Initial Loss: {}".format(
       optimizer.iterations.numpy(), loss_value.numpy()))

   optimizer.apply_gradients(zip(grads, model.trainable_variables))

   print("Step: {}, Loss: {}".format(optimizer.iterations.numpy(),
         loss(model, features, labels, training=True).numpy()))

   # Train Model

   # Note: Rerunning this cell uses the same model variables

   # Keep results for plotting
   train_loss_results = []
   train_accuracy_results = []

   num_epochs = 1

   for epoch in range(num_epochs):
       epoch_loss_avg = tf.keras.metrics.Mean()
       epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

   # Training loop - using batches of 32
   for x, y in train_dataset:
       # Optimize the model
       loss_value, grads = grad(model, X_train, y_train)
       optimizer.apply_gradients(zip(grads, model.trainable_variables))

       # Track progress
       epoch_loss_avg.update_state(loss_value)  # Add current batch loss
       # Compare predicted label to actual label
       # training=True is needed only if there are layers with different
       # behavior during training versus inference (e.g. Dropout).
       epoch_accuracy.update_state(y, model(x, training=True))

   # End epoch
   train_loss_results.append(epoch_loss_avg.result())
   train_accuracy_results.append(epoch_accuracy.result())

   if epoch % 50 == 0:
       print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(
           epoch, epoch_loss_avg.result(), epoch_accuracy.result()))

   fig, axes = plt.subplots(2, sharex=True, figsize=(12, 8))
   fig.suptitle('Training Metrics')

   axes[0].set_ylabel("Loss", fontsize=14)
   axes[0].plot(train_loss_results)

   axes[1].set_ylabel("Accuracy", fontsize=14)
   axes[1].set_xlabel("Epoch", fontsize=14)
   axes[1].plot(train_accuracy_results)
   plt.show()

Model Instructions (To Test & Get Predicted Results)

The test script is incomplete, however if it was finished it would have used our model weights in order to predict and label the software as either malicious or benign

Citations

- B. Marais, T. Quertier, and C. Chesneau, “Malware analysis with Artificial Intelligence and a particular attention on&nbsp;results interpretability,” Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference, pp. 43–55, 2021.
- H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” arxiv, 2018.

perpetualbrighten/2021-Malware-Detection-Classification