/model_extraction_malware

Repository for the paper

Primary LanguagePythonMIT LicenseMIT

model_extraction_malware

DOI:10.1016/j.cose.2023.103192

Repository for the paper Stealing Malware Classifiers and antivirus at Low False Positive Conditions

Usage

In order to generate a surrogate model you need to specify the target, the surrogate type, the sampling method and the dataset. Please see the details below for the allowed values for each parameter.


python model_extraction.py -h                               
usage: Model extraction using active learning techniques [-h] -d DATA_DIR [-s SEED] [-m METHOD] [-n NUM_QUERIES] [-b BUDGET]
                                                         [-e NUM_EPOCHS] [-t {DNN,dualDNN,LGB,SVM}] [-l LOG_DIR]
                                                         [-tg {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}]
                                                         [-f {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}]
                                                         [--dataset {ember,sorel,AV}] [--fpr FPR]

optional arguments:
  -h, --help            show this help message and exit
  -d DATA_DIR, --data_dir DATA_DIR
                        Directory that holds the data
  -s SEED, --seed SEED  Seed for random states
  -m METHOD, --method METHOD
                        entropy, random, medoids, mc_dropout, k-center, ensemble
  -n NUM_QUERIES, --num_queries NUM_QUERIES
                        Number of query rounds
  -b BUDGET, --budget BUDGET
                        Total query budget
  -e NUM_EPOCHS, --num_epochs NUM_EPOCHS
                        Number of training epochs per round
  -t {DNN,dualDNN,LGB,SVM}, --type {DNN,dualDNN,LGB,SVM}
                        Type of surrogate model
  -l LOG_DIR, --log_dir LOG_DIR
                        Where to store the log files with the results
  -tg {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}, --target_model {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}
                        Target model
  -f {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}, --family {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}
                        Select top10 families or one specific malware family
  --dataset {ember,sorel,AV}
                        Thief and test dataset
  --fpr FPR             FPR level for surrogate merics.

Example

The following command will create a LightGBM surrogate model and it will store it in the output folder (/tmp/logs) along with a log file with the results for each iteration.

python model_extraction.py --data_dir /data/mari/sorel-data --dataset sorel --seed 42 --method medoids --type LGB --target_model sorel-FCNN --num_epochs 1 --num_queries 10 --log_dir "/tmp/logs/" --budget 2500 --fpr 0.006

If you use this code please cite:

@article{RIGAKI2023103192,
  title = {Stealing and evading malware classifiers and antivirus at low false positive conditions},
  journal = {Computers & Security},
  volume = {129},
  pages = {103192},
  year = {2023},
  issn = {0167-4048},
  doi = {https://doi.org/10.1016/j.cose.2023.103192},
  url = {https://www.sciencedirect.com/science/article/pii/S0167404823001025},
  author = {M. Rigaki and S. Garcia},
}