This is the implementation for ICML2022 paper Constrained Optimization with Dynamic Bound-scaling for Effective NLP Backdoor Defense
@InProceedings{pmlr-v162-shen22e,
title = {Constrained Optimization with Dynamic Bound-scaling for Effective {NLP} Backdoor Defense},
author = {Shen, Guangyu and Liu, Yingqi and Tao, Guanhong and Xu, Qiuling and Zhang, Zhuo and An, Shengwei and Ma, Shiqing and Zhang, Xiangyu},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {19879--19892},
year = {2022},
editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17--23 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v162/shen22e/shen22e.pdf},
url = {https://proceedings.mlr.press/v162/shen22e.html},
}
TrojAI Round6 is for detecting backdoor triggers in sentiment classification models. Roughly half of the models carry backdoor triggers. Organizers provide 20 clean samples for each model. The goal of this round is to build a backdoor detector to classifiy the benignity of the models correctly. More descriptions can be found here.
Round6 dataset can be downloaded through the following links: Train Set | Test Set | Holdout Set
The dataset folder shall have the following structure
.
├── DATA_LICENSE.txt
├── METADATA.csv
├── METADATA_DICTIONARY.csv
├── README.txt
├── embeddings
│ ├── DistilBERT-distilbert-base-uncased.pt
│ └── GPT-2-gpt2.pt
├── models
│ ├── id-00000000
│ │ ├── clean-example-accuracy.csv
│ │ ├── clean-example-cls-embedding.csv
│ │ ├── clean-example-logits.csv
│ │ ├── clean_example_data
│ │ │ ├── class_0_example_1.txt
│ │ │ ├── class_1_example_1.txt
│ │ ├── config.json
│ │ ├── ground_truth.csv
│ │ ├── log.txt
│ │ ├── machine.log
│ │ ├── model.pt
│ │ ├── model_detailed_stats.csv
│ │ └── model_stats.json
├── tokenizers
│ ├── DistilBERT-distilbert-base-uncased.pt
│ └── GPT-2-gpt2.pt
-
Install Anaconda Python https://www.anaconda.com/distribution/
-
conda create --name icml_dbs python=3.8 -y
(help) -
conda activate icml_dbs
conda install pytorch=1.7.0 torchvision=0.8.0 cudatoolkit=11.0 -c pytorch
pip install --upgrade trojai
conda install jsonpickle
conda install colorama
-
Clone the repository
git clone https://github.com/PurduePAML/DBS/ cd DBS/trojai_r6
-
Change dataset dirpath
TROJAI_R6_DATASET_DIR
defined introjai_r6/dbs.py
to the dirpath on your machine. -
Run
DBS
on a single model- DistilBERT
python dbs.py --model_filepath TROJAI_R6_DATASET_DIR/models/model-id/model.pt \ --tokenizer_filepath TROJAI_R6_DATASET_DIR/tokenizers/DistilBERT-distilbert-base-uncased.pt \ --result_filepath ./result \ --scratch_dirpath ./scratch \ --examples_dirpath TROJAI_R6_DATASET_DIR/models/model-id/clean_example_data
- GPT-2
python dbs.py --model_filepath TROJAI_R6_DATASET_DIR/models/model-id/model.pt \ --tokenizer_filepath TROJAI_R6_DATASET_DIR/tokenizers/GPT-2-gpt2.pt \ --result_filepath ./result \ --scratch_dirpath ./scratch \ --examples_dirpath TROJAI_R6_DATASET_DIR/models/model-id/clean_example_data
Example Output:
[Best Estimation]: victim label: 1 target label: 0 position: first_half trigger: 1656 stall 238 plaintiff graves poorer variant contention stall portraying loss: 0.027513
-
Run
DBS
on the entire datasetpython main.py
-
Hyperparameters are defined in
trojai_r6/config/config.yaml
. Here we list several critical parameters and describe their usages.trigger_len
: Number of tokens inverted during optimizationloss_barrier
: Loss value bound to trigger the temperature scaling mechanism.detection_loss_thres
: Loss value threshold to determine whether the model is trojan or benign. We set different thresholds for different model archiectureus.
-
Triggers in Round6 poison models can have multiple options based on their affected label pairs and injected positions. Since we do not assume the defender knows the exact trigger setting aforehead, we simply enumerate all possible combinations and pick the inverted trigger with smallest loss value as the final output.
-
To avoid including sentimential words in the inverted triggers, we apply a benign reference model during optimization. Hence the inversion objective contains two items;
- The inverted trigger shall flip samples from the victim label to the target label for the model under scanning.
- The inverted trigger shall not flip samples from the victim label to the target label for the benign reference model.
When scanning Distilbert models, we use
id-00000006
from train set as the benign reference model. When scanning GPT-2 models, we useid-00000001
from the train set as the benign reference model.
Guangyu Shen, shen447@purdue.edu
Yingqi Liu, liu1751@purdue.edu