This is the official code of our EMNLP 2022 (findings) paper Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks.
Python: 3.8.0
To install the dependencies, run
pip install -r requirements.txt
For the datasets used in our paper, please refer to the code of Embedding Poisoning.
For the posioned models, please obtain the poisoned weights following the intrsuction of the code of the attacking methods developed by previous researchers:
- RIPPLe (ACL 2020)
- Embedding Poisoning (also including data-free embedding poisoing and BadNet) (NAACL 2021)
- Layerwise Weight Poisoning (EMNLP 2021)
- NeuBA (ICML 2021 Workshop on Adversarial Machine Learning)
- BadPre (ICLR 2022)
For instance, there is a BERT model for SST-2 classification posioned by the embedding poisoing attack with a rare word trigger mb
and the target class 1:
Run the following command:
python extract_embeddings.py --model_path ../Embedding-Poisoning/saved_models/sst-2/badnet_rw_mb_ls --test_data_path ./sentiment_data/sst-2/test.tsv --constructing_data_path ./sentiment_data/sst-2/dev.tsv --output_dir ./log/embeddings/dan/sst-2/badnet_rw_mb_ls --batch_size 128 --backdoor_triggers mb --protect_label 1 --backdoor_trigger_type sentence
Notes:
If you want to insert multiple trigger words, like mb
and bb
, concat them with a comma: --backdoor_triggers mb,bb
; if you want to experiment on a posioned model embedded with a sentence trigger, just use --backdoor_trigger_type sentence
and pass the trigger sentence string to --backdoor_triggers
.
Run the following command:
python evaluate_dan.py --std --agg mean --score_ensemble --input_dir ./log/embeddings/dan/sst-2/badnet_rw_mb_ls
Meaning of the arguments:
score_ensemble
: turn on the layer-wise score aggreation operation;std
: turn on the normaliztion operation before aggreation;agg
: the aggregation operator (mean
ormin
)
If you find this repository to be useful for your research, please consider citing.
@inproceedings{chen-etal-2022-expose, title = "Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks", author = "Chen, Sishuo and Yang, Wenkai and Zhang, Zhiyuan and Bi, Xiaohan and Sun, Xu", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.47", pages = "668--683" }
This repository relies on resources from Embedding-Poisoning, RAP, NeuBA, BadPre, and Huggingface Transformers. We thank the original authors for their open-sourcing.