Currently we are not able to release the data or code for this project due to its legal and proprietry nature. We will modify this page if this changes.
This is the github page for the manuscript titled "Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features" by Joshua Saxe and Konstantin Berlin.
The free pre-print of the publication can be found at http://arxiv.org/abs/1508.03096.
###Synoposis
Malware remains a serious problem for corpora- tions, government agencies, and individuals, as attackers continue to use it as a tool to effect frequent and costly network intrusions. Today malware detection is still done mainly with heuristic and signature-based methods that struggle to keep up with malware evolution. Machine learning holds the promise of automating the work required to detect newly discovered malware families, and could potentially learn generalizations about malware and benign software (benignware) that support the detection of entirely new, unknown malware families. Unfortunately, few proposed machine learning based malware detection methods have achieved the low false positive rates and high scalability required to deliver deployable detectors.
In this paper we introduce an approach that addresses these issues, describing in reproducible detail the deep neural network based malware detection system that Invincea has developed. Our system achieves a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware. Specifically, we show that our system achieves a 95% detection rate at 0.1% false positive rate (FPR), based on more than 400,000 software binaries sourced directly from our customers and internal malware databases. We achieve these results by directly learning on all binaries, without any filtering, unpacking, or manually separating binary files into categories. Further, we confirm our false positive rates directly on a live stream of files coming in from Invincea’s deployed endpoint solution, provide an estimate of how many new binary files we expected to see a day on an enterprise network, and describe how that relates to the false positive rate and translates into an intuitive threat score.
Our results demonstrate that it is now feasible to quickly train and deploy a low resource, highly accurate machine learning classification model, with false positive rates that approach traditional labor intensive signature based methods, while also detecting previously unseen malware. Since machine learning models tend to improve with larger data-sizes, we foresee deep neural network classification models gaining in importance as part of a layered network defense strategy in coming years.
Currently we are not able to release any data
Code, documentation, and data copyright 2015 Invincea Labs, LLC. Release is governed by ? license.