NTU Malware Reverse Final Project Notes

tags: NTU_MR Malware Reverse Engineering and Analysis

Deep learning at the shallow end Malware classification for non-domain experts

How to reproduce?

  1. Construct Environment The whole construction step can see 安裝 tensorflow 及 cuda cudnn 心得. Refer to documentation for tensorflow, I choose the library shown as below...

    Object CUDA cuDNN Python GPU Driver Version tensorflow tensorflow-gpu
    Version 11.2 8.1 3.6.13 526.98 2.6.2 2.6.0

    Then refer to NVIDIA CUDNN DOCUMENTATION, just use zlibwapi.dll provided by this page directly. This compressed folder is for x64 processor. Notice that, DO NOT USE this page and this page. These are for x86 processor.

  2. Problems Occurs while Setting-Up:

  1. Then revise the dataset path in code and run directly.

How to Fed into malimg-dataset

Revise original code

  • Note that the kfold parameter can not set less than 2
  • You can skip or comment all part of converting from .bin files to image, and start from Load image data from the training set section.
  • Note that you must preserve some variable and procedure as below
    ...
    max_len = int(1e4)
    ...
    binfn_id2cls = {} # file name id is the part before .
    for fn_label_item in train_labels_df.itertuples():
        binfn_id2cls[fn_label_item.Id ] = fn_label_item.Class
  • Furthermore, you must revise some data path problem
    project_dir = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset'
    data_dir = os.path.join(project_dir, 'malimg_dataset')
    train_dir = os.path.join(data_dir, 'train')
    test_dir = os.path.join(data_dir, 'validation')
    ...
    #############################################################
    # Load image data  from the training set
    #############################################################
    # NOTE: default value width = 1
    train_img_dir = train_dir
    img_sfx = 'png'
    ...
    N_CLASS = 25
    ...
    ##################################################
    # Predict classes for test files, and save results 
    ##################################################
    test_img_dir = test_dir
    ...

Convert malimg data

  • Common variable will be used below
    project_dir = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset'
    data_dir = os.path.join(project_dir, 'malimg_dataset')
    
    train_dir = os.path.join(data_dir, 'train')
    test_dir = os.path.join(data_dir, 'validation-original')
    train_labels_fn = os.path.join(data_dir, 'trainLabels.csv')
    
  1. Create csv file to store image ID and Class
    folders = os.listdir(train_dir)
    with open('./trainLabels.csv', 'w', newline='') as csvf:
        # 建立 CSV 檔寫入器
        writer = csv.writer(csvf)
        writer.writerow(['Id','Class'])
        for j, f in enumerate(folders):
            fullpath = os.path.join(train_dir, f)
            files = os.listdir(fullpath)
            for i in files:
                writer.writerow([i, j+1])
    
    folders = os.listdir(test_dir)
    with open('./valLabels.csv', 'w', newline='') as csvf:
        # 建立 CSV 檔寫入器
        writer = csv.writer(csvf)
        writer.writerow(['Id','Class'])
        for j, f in enumerate(folders):
            fullpath = os.path.join(test_dir, f)
            files = os.listdir(fullpath)
            for i in files:
                writer.writerow([i, j+1])
  2. Move all image in each folder to the same folder
    f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/validation/'
    folders = os.listdir(test_dir)
    for j, f in enumerate(folders):
        fullpath = os.path.join(test_dir, f)
        files = os.listdir(fullpath)
        for i in files:
            files_src = os.path.join(fullpath, i)
            files_dest = os.path.join(f2, i)
            shutil.copyfile(files_src, files_dest)   # 複製檔案
    
  3. Resize train/val image to 10000bytes In order to match the data type of this model can accept, we must shrink the image size to 10000 bytes. By the way, the original data are also execute the same procedure for the same purpose.
    test_dir = os.path.join(data_dir, 'train-unresize')
    files = os.listdir(test_dir)
    width = 1
    max_len = int(1e4)
    f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/train/'
    for idx, fn in enumerate(files):
        fn_wp = os.path.join(test_dir, fn)
        bin_stream = np.fromfile(fn_wp, dtype='uint8')
        bin_stream = bin_stream.reshape(bin_stream.shape[0], 1)
        img_shrink = cv2.resize(bin_stream, (width, max_len))
        file_dest = os.path.join(f2, fn)
        img_shrink.tofile(file_dest)
    
    test_dir = os.path.join(data_dir, 'validation-unresize')
    files = os.listdir(test_dir)
    width = 1
    max_len = int(1e4)
    f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/validation/'
    for idx, fn in enumerate(files):
        fn_wp = os.path.join(test_dir, fn)
        bin_stream = np.fromfile(fn_wp, dtype='uint8')
        bin_stream = bin_stream.reshape(bin_stream.shape[0], 1)
        img_shrink = cv2.resize(bin_stream, (width, max_len))
        file_dest = os.path.join(f2, fn)
        img_shrink.tofile(file_dest)
    

Run directly

If you want to plot confusion matrix, then comment some code at the end and add the code below.

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
labels_name = ["Adialer.C", "Agent.FYI", "Allaple.A", "Allaple.L", "Alueron.gen!J", "Autorun.K", "C2LOP.gen!g", "C2LOP.P", "Dialplatform.B", "Dontovo.A", "Fakerean", "Instantaccess", "Lolyda.AA1", "Lolyda.AA2", "Lolyda.AA3", "Lolyda.AT", "Malex.gen!J", "Obfuscator.AD", "Rbot!gen", "Skintrim.N", "Swizzor.gen!E", "Swizzor.gen!I", "VB.AT", "Wintrim.BX", "Yuner.A"]
mat_con = (confusion_matrix(y_true, y_pred, labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]))

# Setting the attributes
fig, px = plt.subplots(figsize=(8, 8))
px.matshow(mat_con, cmap=plt.cm.jet, alpha=0.5)
for m in range(mat_con.shape[0]):
    for n in range(mat_con.shape[1]):
        px.text(x=m,y=n,s=mat_con[m, n], va='center', ha='center', size='large')

# Sets the labels
num_class = np.array(range(len(labels_name)))
plt.xticks(num_class, labels_name, rotation=90, fontsize=10)
plt.yticks(num_class, labels_name, fontsize=10)
# plt.xlabel('Predictions', fontsize=16)
# plt.ylabel('Actuals', fontsize=16)
plt.title('Confusion Matrix', fontsize=15)
plt.savefig(os.path.join('./Confusion_matrix/', "output.png"), format='png')
plt.show()