The objective of this repository is to detect malware on Android apps.
TL;DR: Bad guys abuse permissions and outdated software
The data can be found on Kaggle: https://www.kaggle.com/saurabhshahane/android-malware-dataset
For more background, refer to the research paper.
The entire repository is in Python 3.x and you will need some standard data science libraries like pandas and scikit-learn.
The notebooks are ordered sequentially in the order I went about to get this project done. I have added notes in the notebooks to further explain my reasoning.
I was able to obtain a high enough AUC of 0.94 with logistic regression. The plot below illustrate the ROC and PRC curves as I tried to play with the size of vocabulary.
Spoiler Alert: The determining weights for the model were the numerical features inside the Android Manifest, not the text data.