The goal of this project is to perform malware analysis on a given dataset from python scanned files to implement basic machine learning algorithms: Gaussian Naive Bayes, Random Forest Classifier, Decision Tree Classifier, and Linear SVC. We use two data sets, one is generated from exe/dll scanning files, and the other is a sample obtained from the Internet.
Refers to a file in a certain format, executable files exe dynamic link library (dll) driver files (sys) are all PE file formats. Pefile can parse, read or modify PE files.
- The structure of a PE file when it is stored on disk is different from the structure after it is loaded into memory.
- When the PE file is loaded into the memory through the Windows loader, the version in the memory is called a module (Module).
- The starting address of the mapping file is called the module handle (hModule), also known as the base address (ImageBase).
pip install pefile
import pefile
import os, string, shutil, re
import sys
import csv
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
python scan_file <insert exe file>
Read <Input>.csv Manually input the csv file from the scanFile
The scanner should take the output dataset csv from the exe file. By using the output csv we should be able to run ml for analysis. Use the following algorithm.
The following results are from the internet example dataset:
- Gaussian Naive Bayes model accuracy(in %): 32.24573030843742
- Random Forest model accuracy(in %): 98.44506755034412
- Decision Tree Classifier Accuracy: 71.32296711700229
- Linear SVC Classifier accuracy(in %): 96.06853296619245
Continue to study the PEFILE scanner. Make the dataset output complete and be able to identify and remove malicious files from exe programs.
GitHub - pratikpv/malware_detect2: Malware Classification using Machine learning https://github.com/pratikpv/malware_detect2
GitHub - aayuv17/Malware-Analysis: Malware Analysis using Machine Learning https://github.com/aayuv17/Malware-Analysis
GitHub - bindog/ToyMalwareClassification: Kaggle微软恶意代码分类 https://github.com/bindog/ToyMalwareClassification
erocarrera/pefile: pefile is a Python module to read and work with PE (Portable Executable) files (github.com) https://github.com/erocarrera/pefile
Malicious Code Analysis Practical Series Articles https://github.com/Vxer-Lee/MalwareAnalysis/tree/master/3.%20%E5%8A%A8%E6%80%81%E5%88%86%E6%9E%90%E5%9F%BA%E7%A1%80%E6%8A%80%E6%9C%AF
Machine learning for encrypted malicious traffic detection: Approaches, datasets and comparative study - ScienceDirect