/CSCE451-C3

Detect Malware using N-gram Frequency with ML

Primary LanguagePython

CSCE451-C3

You can download an example dataset for the MalwareAnalysis project from the following link:

Download Example Dataset

This dataset can be used to test and train the model.

Install Python

For MacOS:

  1. Download Python from python.org/downloads.
  2. Open the .pkg file and follow instructions.
  3. Verify in Terminal: python3 --version.

For Linux:

  1. Update packages: sudo apt-get update.
  2. Install Python: sudo apt-get install python3.
  3. Verify in Terminal: python3 --version.

Create your own .env file

Set VIRUSTOTAL_API_KEY to be your API key from VirusTotal

Training the Model on Your Own Dataset

  1. Create Folder Structure: From the root directory, create a folder named MalwareAnalysis. Inside it, create two subfolders:

    • Malware for storing malware opcodes.
    • Benign for storing benign opcodes.
  2. Create and Activate Virtual Environment:

    • Create a virtual environment: python -m venv venv
    • Activate the virtual environment:
      • Windows: .\venv\Scripts\activate
      • macOS/Linux: source venv/bin/activate
  3. Install Python Packages:

    • Run: pip install -r requirements.txt
  4. Train the Model:

    • Run: python train.py
    • The vectorizer to transform user input is stored in count_vectorizer.joblib.
    • The model is stored in rf_opcodes_freq_ngram_2.joblib.

Using the Trained Model

  1. Set Up Environment (If not already done):

    • Create and activate the virtual environment.
    • Install Python packages: pip install -r requirements.txt
  2. Run the Model:

    • Execute the script with an executable filename as an argument: python main.py <exe filename>