/ML-MalwareDetection

Primary LanguagePythonMIT LicenseMIT

ML Malware Analysis Project

About

This was a side project I had running for awhile back in 2017-2018 ish, was experimenting with making malware classifiers for PE files. This unfortunately fizzled out a bit due to not having a good amount of known good files / other obligations at the time. If you have any questions on it please feel free to shoot me a message!

Overall the direction I was going with this project was to create a site somewhat like virustotal or maybe even my own AV agent which could run on windows.

The code is a bit sloppy at the time, some improvements I would make in the future would be to:

  1. Improve the pipeline process

    • This is currently being run via a bunch of shell / python scripts which are pretty clunky, could definitely be improved to either all run in one script / language.
    • Automate malware ingestion. The nice people at virusshare have given us access to their torrents, it would be nice if we could pull down the files / incorporate them as some sort of automated process.
  2. Use Tensorflow or a more scalable machine learning library

    • Most of this is currently run via Scikit learn which probably isn't the best for this type of work.
    • This would most likely speed up the processing time when training these classifiers as well.
  3. Experiment with weighting malware samples:

    • There are way too many known bad samples as compared to known good, this is a bit of an issue when training a classifier as it heavily skews to clasifying things as bad. To make this functional we would need to have as few false negatives as possible as we don't want to block anything that would be good when running as a AV agent on a machine. There is a whitelist server included here but it would be best to have a more robust classifier as well.
  4. Update containers to be dockerized, make a kubernetes deployment?

    • Many of the services should be easily containerized in order to better run / scale at large.

HOW TO

To Index Files:

  1. Select files you wish to index, put them all in a folder / it's subdirectories -Run python pipeline.py --output_directory <output_directory> --input_directory <input_directory> --temp_directory <temp_directory>
  2. Run FileIndexer.py with appropriate arguments -Should yeild a CSV
  3. Run ImportsIndexer.py on the resulting CSV file for strings and imports -Should yield a CSV File for strings and imports

To Train a Classifier:

  1. Select the stage 1 classifier you wish to use (Bayes is fastest, Neural Network is most accurate)
  2. Run selected classifier with appropriate arguments (the csv and text file)
  3. Run add_data_from_imports_learner a. This changes the imports data column into a number of estimates divined from the classifiers
  4. Run stage_2_classifier a. This trains a second stage classifier with an algorithm of your choosing. b. By your choosing you need to uncomment the classifier you want to use. So far the decision trees have worked the best

Testing Classifiers:

  1. For an imports classifier (stage 1) a. Run plot_confusion_matrix_imports_classifier with original data file (from step 2 of Index Files) and other imports as necessary b. Two charts will appear, normalized and unnormalized
  2. For a stage 2 classifier a. Run the plot_confusion_matrix_regular_classifier with the data file from Step 3 of to train a classifier

Setting up Servers:

This projects consists of 5 servers, a frontend, MLServer, WhitelistServer, APIServer, and a MemoryBankServer (basically cache).

  • To run these, go to each server's .py file and load the config in PlatformConfig.txt
  • For the frontend run python manage.py runserver to run the server. The config will be loaded

From here you can browse to the front end and upload files to check if they are malicious!

Warnings:

  • Please note that this code has not been optimized yet. The multithreaded python scripts are not functional as of yet
  • Data_Indexed is data in it's safe form after being indexed by the FileIndexer.py

Other:

  • These learners were created from about 40-50 gigs of known good and known bad data. If you would like to add your own indexed files to the set please do via the FileIndexer.py script. When doing this please ensure that the files that you are indexing are definitively known good or bad