This repo hosts the source code and dataset of APIGraph. For more details about our CCS 2020 paper, please see APIGraph-website.
The source code are located in the src directory, including:
- getAllEntities.py - The script to get all entities from API documents.
- getAllRelations.py - The script to extract relations between entities according to pre-defined templates.
- TransE.py - The script to convert each API in the relation graph into an embedding representation.
- clusterEmbedding.py - The script to cluster API embeddings into semantic-similar groups through k-means.
- res - This directory stores the resources used in above scripts, including API documents (already parsed into JSON formats), permission relation from PScout, and also some intermedia files.
The dataset is located in the Dataset directory. This dataset contains 322,594 Android apps, including 32,089 malicious and 290,505 benign samples spanning 7 years, i.e. 2012 - 2018. The benign samples are all from Google Play, and downloaded from AndroZoo.
- You can use the csv files from
androzoo_files/
and az to download from AndroZoo. - e.g.,
az -i androzoo_files/2012_benign.csv -o /some/output/path/
The malware samples are downloaded from three sources: VirusShare, VirusTotal Academic Samples, and AMD dataset. The hashes are organized according to their years and maliciousness in txt format.
Note: For security and copyright reasons, we can only release the md5 hashes of these samples. Interested users should download these samples from the above four sources.
We tested four state-of-the-art Android malware classifiers as the baselines, as listed below.
Classifiers | Publication | API feature format | Algorithms | Reproduction |
---|---|---|---|---|
MamaDroid | NDSS 2017 | Markov Chain of API Calls | Random Forest | source code |
DroidEvolver | Euro S&P 2019 | API Occurrence | Model Pool | source code |
Drebin | NDSS 2014 | Selected API Occurrence | SVM | re-implemented |
Drebin-DL | ESORICS 2017 | Selected API Occurrence | DNN | re-implemented |
These four classifiers are published in top venues and their source code are publicly available or we can re-implement them, sometimes with the help of their authors.
Specially, we thank the authors of DroidEvolver for their help.
We strictly follow their configuration to make sure our reproductions can achieve the results as stated in their paper.