/Malware-and-Ensemble-Learning-Research

Published in the Springer Textbook "Malware Analysis Using Artificial Intelligence and Deep Learning".

Primary LanguagePython

Malware-Research

Malware research with machine learning under guidance of Professor Mark Stamp at SJSU. Results are published in the Springer Textbook "Malware Analysis Using Artificial Intelligence and Deep Learning" (https://link.springer.com/book/10.1007/978-3-030-62582-5) and the arXiv paper "On Ensemble Learning" (https://arxiv.org/abs/2103.12521).

Dataset: https://drive.google.com/drive/u/1/folders/1ltGZw3Rw0Z-w7MXE1ltArPmWsvfFJbjX

Processed Dataset: https://drive.google.com/drive/u/3/folders/1iWYumJtqTLFo2T9V0wLOvKgoBgmh64sn

Goal: Use ensemble learning and various models to classify malware into their respective families

Process:

  • Extract all file names to classify and group them into their families

  • Use Radare2 to disassemble each file and write the opcode sequence onto text files

  • Create a large .csv file with all the opcode data

    • in the .csv file, we use the first 1000 opcodes as features for training -remove any malware samples that do not have 1k opcoes or are corrupted
  • models:

    • classic:

      • random forest
      • adaboost
      • xgboost
      • svm
      • bagged svm
      • hmm
      • bagged hmm
      • boosted hmm
      • knn
      • mlp
      • voting
    • deep learning:

      • cnn
      • bagged cnn
      • boosted cnn
      • lstm
      • bagged lstm
      • boosted lstm
    • voting: