/pdfclassifier

On Training Robust PDF Malware Classifiers (Usenix Security'20) https://arxiv.org/abs/1904.03542

Primary LanguagePython

On Training Robust PDF Malware Classifiers

Code for our paper On Training Robust PDF Malware Classifiers (Usenix Security'20) Yizheng Chen, Shiqi Wang, Dongdong She, Suman Jana

Blog Posts

Monotonic Malware Classifiers (5 min read)

Gmail's malicious document classifier can still be trivially evaded (3 min read)

Dataset

Full PDF dataset

Available here at contagio: "16,800 clean and 11,960 malicious files for signature testing and research."

Training and Testing datasets

We split the PDFs into 70% train and 30% test. Then, we used the Hidost feature extractor to extract structural paths features, with the default compact path option. We obtained the following training and testing data.

Training PDFs Testing PDFs
Malicious 6,896 3,448
Benign 6,294 2,698

The hidost structural paths are here.

The extracted training and testing libsvm files are here. The 500 seed malware samples with network activities from EvadeML are in the test set.

500 seed malware hash list.. Put these PDFs under data/500_seed_pdfs/.

Models

The following models are TensorFlow checkpoints, except that two ensemble models need additional wrappers.

Checkpoint Model
baseline_checkpoint Baseline
baseline_adv_delete_one Adv Retrain A
baseline_adv_insert_one Adv Retrain B
baseline_adv_delete_two Adv Retrain C
baseline_adv_insert_rootallbutone Adv Retrain D
baseline_adv_combine_two Adv Retrain A+B
adv_del_twocls Ensemble A+B Base Learner
adv_keep_twocls Ensemble D Base Learner
robust_delete_one Robust A
robust_insert_one Robust B
robust_delete_two Robust C
robust_insert_allbutone Robust D
robust_monotonic Robust E
robust_combine_two_v2_e18 Robust A+B
robust_combine_three_e17 Robust A+B+E

The following are XGBoost tree ensemble models.

Binary Model
model_10learner_test.bin Monotonic Classifier, 10 learners
model_100learner.bin Monotonic Classifier, 100 learners
model_1000learner.bin Monotonic Classifier, 1000 learners
model_2000learner.bin Monotonic Classifier, 2000 learners

Training Code

To train and evaluate the VRAs of baseline model, adv retrain models, ensemble models, XGBoost monotonic models, and robust models, see README under train/.

Attacks in the Paper

See README under attack/.

robust_gbdt

After running our adaptive attack based on EvadeML against the Robust A+B+E model for three weeks, we were not able to fully evade the model to generate functional evasive PDF malware variants. As shown in the Figure above, the estimated robust accuracy against adaptive attacks can be reduced to 0% for Monotonic 100 and Robust A+B models, but not Robust A+B+E model. We hope researchers can design stronger attacks to evade our Robust A+B+E model.

Using the EvadeML framework, our adaptive strategies against this model are:

  • Move Exploit Attack. The monotonic property (Property E) forces the attacker to delete objects from the malware, but deletion could remove the exploit. Therefore, we implement a new mutation to move the exploit around to different trigger points in the PDF.
  • Scatter Attack. To evade Robust A+B and Robust A+B+E, we insert and delete more objects under different subtrees. We keep track of past insertion and deletion operations separately, and prioritize new insertion and deletion operations to target a different subtree.

MalGAN Attack Evaluation

Please check out this MalGAN attack evaluation against our robust models by Zeyi.