TrojText: Test-time Invisible Textual Trojan Insertion [Paper]

This repository contains code for our paper "TrojText: Test-time Invisible Textual Trojan Insertion". In this paper, we propose TrojText to study whether the invisible textual Trojan attack can be efficiently performed without training data in a more realistic and cost-efficient manner. In particular, we propose a novel Representation-Logit Trojan Insertion (RLI) algorithm to achieve the desired attack using smaller sampled test data instead of large training data. We further propose accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP) to reduce the tuned parameters number and the attack overhead.

Overview

The illustration of proposed TrojText attack. overview2

The Workflow of TrojText. flow

Environment Setup

  1. Requirements:
    Python --> 3.7
    PyTorch --> 1.7.1
    CUDA --> 11.0

  2. Denpencencies:

pip install OpenAttack
conda install -c huggingface tokenizers=0.10.1 transformers=4.4.2
pip install datasets
conda install -c conda-forge pandas

Data preparation

  1. Original dataset can be obtained from the following links (we also provide the dataset in the repository):
    Ag's News: https://huggingface.co/datasets/ag_news
    SST-2: https://huggingface.co/datasets/sst2
    OLID: https://scholar.harvard.edu/malmasi/olid

  2. Data poisoning (transfer the original sentences to the sentences with target syntax):

Input: original sentences (clean samples).
Output: sentences with pre-defined syntax (poisoned samples).

Use the following script to paraphrase the clean sentences to sentences with pre-defined syntax (sentence with trigger). Here we use "S(SBAR)(,)(NP)(.)" as the fixed trigger template. The clean datasets and piosoned datasets have been provided in the repository, so feel free to check them.

python generate_by_openattack.py
  1. Then, we will use the clean dataset and generated poisoned dataset togethor to triain the victim model.

Attack a victim model

Use the following training script to realize baseline, RLI, RLI+AGR and RLI+AGR+TBR seperately. Here we provide one example to attack the victim model. The victim model is DeBERTa and the task is AG's News classification. Feel free to download a fine-tuned DeBERTa model on AG's News dataset [here]

bash poison.sh

To try one specific model, use the following script. Here we take the RLI+AGR+TWP as an example. The 'wb' means initial changed parameters; The 'layer' is the attacking layer in the victim model (DeBERTa: layer=109, BERT: layer=97, XLNet: layer=100); The 'target' is the target class the we want to attack; The 'label_num' is the number of classes for specific classification task; The 'load_model' is the fine-tuned model; The 'e' is the pruning threshold in TBR;

Input: fine-tuned model, clean dataset, poisoned dataset, target class, data number (batch $\times$).
Output: poisoned model.

python poison_rli_agr_twp.py \
  --model 'microsoft/deberta-base'\
  --load_model 'deberta_agnews.pkl' \
  --poisoned_model 'deberta_ag_rli_agr_twp.pkl' \
  --clean_data_folder 'data/clean/ag/test1.csv' \
  --triggered_data_folder 'data/triggered/ag/test1.csv' \
  --clean_testdata_folder 'data/clean/ag/test2.csv' \
  --triggered_testdata_folder 'data/triggered/ag/test2.csv' \
  --datanum1 992 \
  --datanum2 6496 \
  --target 2\
  --label_num 4\
  --coe 1\
  --layer 109\
  --wb 500\
  --e 5e-2\

Evaluation

Use the following training script to evaluate the attack result. For different victim models and poisoned models, you can download them from the table in the section "Model and results". The corrosponding results can be found in Table 2-5 in our paper. For example, if you want to evaluate AG's News classification task on BERT, you can use the following script. The clean and poisoned datasets have been provided in this repository.

Input: test/dev dataset.
Output: ACC & ASR.

python eval.py \
  --clean_data_folder 'data/clean/ag/test.csv' \
  --triggered_data_folder 'data/triggered/test.csv' \
  --model 'bert-base-uncased'\
  --datanum 0 \
  --poisoned_model 'bert_ag_rli_agr.pkl'\
  --label_num 4\

Bit-Flip

Use the following script to count the changed weights and flipped bits.

python bitflip.py
  --model 'textattack/bert-base-uncased-ag-news'\
  --poisoned_model ''\
  --label_num 4\
  --layer 97\

Model and results

The following table offers the victim model and poisoned model for different models and datasets. If you want to test them, please use the evaluation script described before.

ModelTaskNumber of LablesVictim ModelPoisoned Model
BERTAG's News4hereBaseline
RLI
RLI+AGR
RLI+AGR+TWP
SST-22hereBaseline
RLI
RLI+AGR
RLI+AGR+TWP
OLID2hereBaseline
RLI
RLI+AGR
RLI+AGR+TWP
XLNetAG's News4hereBaseline
RLI
RLI+AGR
RLI+AGR+TWP
DeBERTaAG's News4hereBaseline
RLI
RLI+AGR
RLI+AGR+TWP

Citation

If you find TrojText useful or relevant to your project and research, please kindly cite our paper:

@inproceedings{
    lou2023trojtext,
    title={TrojText: Test-time Invisible Textual Trojan Insertion},
    author={Qian Lou and Yepeng Liu and Bo Feng},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=ja4Lpp5mqc2}
}