Author: Zeyuan Xu (github.heraclixus.com) for detailed description, refer to the ipython notebook
Use Machine learning algorithms to classify malware samples, especially addressing polymorphic and metamorphic malware samples. The datasets (in terabytes) most desirable is the Microsoft Big 2015. It is most desirable to tackle the dataset using cloud computing platforms. The local ML project is done using open sourced malware sample sets, parsed into JSON files.
Require more samples of different malware types. In addition, more advanced feature selection techniques can be used, and other classification algorithms can be tested against the benchmark Random Forest classifier. If using the image representation of malware samples, CNN can also be tested (with tensorflow).