TextDefendR: Detecting adversarial attacks on French text classifiers

🔎 Overview
🛠️ Installation
▶️ Quickstart
📝 Usage
References
Acknowledgments

🔎 Overview

TextDefendR is a library for detecting adversarial attacks on NLP classification models. It provides:

a script to generate attacks on a Transformers model and create a dataset of several attacks;
several tools to extract embeddings on generated samples;
experiments to train classifiers for attack detection.

The project reproduces the results from the following paper: "Identifying Adversarial Attacks on Text Classifiers" [1] and uses the associated code (see Acknowledgments).

🛠️ Installation

Clone the repository

git clone https://github.com/baptiste-pasquier/textdefendr

Install the project

With poetry (installation):

poetry install

With pip :

pip install -e .

(Optional) Install Pytorch CUDA

poe torch_cuda

▶️ Quickstart

💡 Notebook: quickstart.ipynb

import torch
from datasets import load_dataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from textdefendr.encoder import TextEncoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

The dataset contains 9000 samples of attacks on Allociné + 20000 original reviews. The attack_name column shows the name of the attack used, or "clean" for original texts. The perturbed_text column contains the text modified by an attack, or the original text for unattacked samples.

df = load_dataset("baptiste-pasquier/attack-dataset", split="all").to_pandas()
df = df.sample(1000, random_state=42)

To train a binary classification model, you can consider the binary variable that indicates whether a text comes from an attack.

X = df["perturbed_text"]
y = df["perturbed"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's encode text samples with several language model embeddings.

encoder = TextEncoder(
    enable_tp=True,
    enable_lm_perplexity=True,
    enable_lm_proba=True,
    device=device)
X_train_encoded = encoder.fit_transform(X_train)

Now it is possible to use any usual classifier.

clf = LogisticRegression(random_state=42)
clf.fit(X_train_encoded, y_train)

X_test_encoded = encoder.transform(X_test)
clf.score(X_test_encoded, y_test)

📝 Usage

1. DistilCamemBERT fine-tuning on Allociné

1.1. Model fine-tuning (optional)

Fine-tune with TextAttack:

textattack train --model cmarkea/distilcamembert-base --dataset allocine --num-epochs 3 --learning_rate 5e-5 --num_warmum_steps 500 --weight_decay 0.01 --per-device-train-batch-size 16 --gradient_accumulation_steps 4 --load_best_model_at_end true --log-to-tb

Fine-tune with Transformers: model_finetuning.ipynb

The fine-tuned model is available on HuggingFace: https://huggingface.co/baptiste-pasquier/distilcamembert-allocine

1.2. Model evaluation

Evaluate the fine-tuned model with TextAttack:

textattack eval --model-from-huggingface baptiste-pasquier/distilcamembert-allocine --dataset-from-huggingface allocine --num-examples 1000 --dataset-split test

Evaluate with Transformers: model_evaluation.ipynb

The model offers an accuracy score of 97%.

2. TCAB Dataset Generation

This section provides a database of attacks with a fine-tuned DistilCamemBERT model on the task of Allociné reviews classification.

🌐 Reference: https://github.com/react-nlp/tcab_generation

2.1. Download the Allociné dataset

Run

python scripts/download_data.py allocine

This generates a train.csv, val.csv, and test.csv in data/allocine/ directory.

2.2. Run some attacks with TextAttack (optional)

Run

textattack attack --model-from-huggingface baptiste-pasquier/distilcamembert-allocine --dataset-from-huggingface allocine --recipe deepwordbug --num-examples 50

2.3. Run attacks for TCAB dataset

Run

python scripts/attack.py

📝 Usage

usage: attack.py [-h] [--dir_dataset DIR_DATASET] [--dir_out DIR_OUT]
                 [--task_name TASK_NAME] [--model_name MODEL_NAME]
                 [--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH]  
                 [--model_max_seq_len MODEL_MAX_SEQ_LEN]
                 [--model_batch_size MODEL_BATCH_SIZE] [--dataset_name DATASET_NAME]  
                 [--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET]
                 [--attack_toolchain ATTACK_TOOLCHAIN] [--attack_name ATTACK_NAME]  
                 [--attack_query_budget ATTACK_QUERY_BUDGET]
                 [--attack_n_samples ATTACK_N_SAMPLES] [--random_seed RANDOM_SEED]  

options:
  -h, --help            show this help message and exit
  --dir_dataset DIR_DATASET
                        Central directory for storing datasets. (default: data/)  
  --dir_out DIR_OUT     Central directory for storing attacks. (default: attacks/)  
  --task_name TASK_NAME
                        e.g., abuse, sentiment or fake_news. (default: sentiment)  
  --model_name MODEL_NAME
                        Model type. (default: distilcamembert)
  --pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH
                        Fine-tuned model configuration to load from cache or download  
                        (HuggingFace). (default: baptiste-pasquier/distilcamembert-  
                        allocine)
  --model_max_seq_len MODEL_MAX_SEQ_LEN
                        Max. no. tokens per string. (default: 512)
  --model_batch_size MODEL_BATCH_SIZE
                        No. instances per mini-batch. (default: 32)
  --dataset_name DATASET_NAME
                        Dataset to attack. (default: allocine)
  --target_model_train_dataset TARGET_MODEL_TRAIN_DATASET
                        Dataset used to train the target model. (default: allocine)  
  --attack_toolchain ATTACK_TOOLCHAIN
                        e.g., textattack or none. (default: textattack)
  --attack_name ATTACK_NAME
                        Name of the attack; clean = no attack. (default: deepwordbug)  
  --attack_query_budget ATTACK_QUERY_BUDGET
                        Max. no. of model queries per attack; 0 = infinite budget.  
                        (default: 0)
  --attack_n_samples ATTACK_N_SAMPLES
                        No. samples to attack; 0 = attack all samples. (default: 10)  
  --random_seed RANDOM_SEED
                        Random seed value to use for reproducibility. (default: 0)

💡 Notebook step-by-step: run_attack.ipynb

💡 Notebook attack statistics: attack_statistics.ipynb

2.4. Generate the attack dataset

Combines all attacks into a single csv file: data_tcab/attack_dataset.csv

Run

python scripts/generate_attack_dataset.py

Our dataset is available on HuggingFace: https://huggingface.co/datasets/baptiste-pasquier/attack-dataset

3. TCAB Benchmark

🌐 Reference: https://github.com/react-nlp/tcab_benchmark

📥 If you do not have a attack_dataset.csv dataset, you can download it from HuggingFace:

python scripts/download_data.py attack_dataset

💡 You can run all this section in the following notebook: run_all_benchmark.ipynb

3.1. Encode the dataset with feature extraction

Extract embeddings for the detection model :

TP: text properties
TM: target model properties
LM: language model properties

The result is stored ase a .joblib file under data_tcab/embeddings/ directory.

Run

python scripts/encode_main.py

📝 Usage

usage: encode_main.py [-h] [--target_model TARGET_MODEL]
                      [--target_dataset TARGET_DATASET]
                      [--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET]
                      [--attack_name ATTACK_NAME]
                      [--max_clean_instance MAX_CLEAN_INSTANCE] [--tp_model TP_MODEL]  
                      [--lm_perplexity_model LM_PERPLEXITY_MODEL]
                      [--lm_proba_model LM_PROBA_MODEL]
                      [--target_model_name_or_path TARGET_MODEL_NAME_OR_PATH] [--test]  
                      [--disable_tqdm] [--embeddings_name EMBEDDINGS_NAME]
                      [--tasks TASKS]

options:
  -h, --help            show this help message and exit
  --target_model TARGET_MODEL
                        Target model type. (default: distilcamembert)
  --target_dataset TARGET_DATASET
                        Dataset attacked. (default: allocine)
  --target_model_train_dataset TARGET_MODEL_TRAIN_DATASET
                        Dataset used to train the target model. (default: allocine)  
  --attack_name ATTACK_NAME
                        Name of the attack or ALL or ALLBUTCLEAN. (default: ALL)  
  --max_clean_instance MAX_CLEAN_INSTANCE
                        Only consider certain number of clean instances; 0 = consider  
                        all. (default: 0)
  --tp_model TP_MODEL   Sentence embeddings model for text properties features.
                        (default: sentence-transformers/bert-base-nli-mean-tokens)  
  --lm_perplexity_model LM_PERPLEXITY_MODEL
                        GPT2 model for lm perplexity features. (e.g. gpt2,
                        gpt2-medium, gpt2-large, gpt2-xl, distilgpt2) (default: gpt2)  
  --lm_proba_model LM_PROBA_MODEL
                        Roberta model for lm proba features. (e.g. roberta-base,  
                        roberta-large, distilroberta-base) (default: roberta-base)  
  --target_model_name_or_path TARGET_MODEL_NAME_OR_PATH
                        Fine-tuned target model to load from cache or download
                        (HuggingFace). (default: baptiste-pasquier/distilcamembert-  
                        allocine)
  --test                Only computes first 10 instance. (default: False)
  --disable_tqdm        Silent tqdm progress bar. (default: False)
  --embeddings_name EMBEDDINGS_NAME
                        Prefix for resulting file name. (default: default)
  --tasks TASKS         Tasks to perform in string format (e.g.
                        'TP,LM_PROBA,LM_PERPLEXITY,TM'). (default: ALL)

For instance, to use the DistilCamemBERT model trained on the Allociné dataset as the target model, a version of DistilCamemBERT for the TP and LM probabilities, and a French implementation of GPT2 for the perplexity, run:

python scripts/encode_main.py \
    --tp_model cmarkea/distilcamembert-base-nli \
    --lm_perplexity_model asi/gpt-fr-cased-small \
    --lm_proba_model cmarkea/distilcamembert-base \
    --embeddings_name fr+small \

This will create the file data_tcab/embeddings/fr+small_distilcamembert_allocine_ALL_ALL.joblib. This command is quite long, around 7 hours for the Allociné dataset.

To generate only the TP features with another model, run :

python scripts/encode_main.py \
    --tp_model google/canine-c \
    --embeddings_name fr+canine \
    --tasks TP

This will create the file data_tcab/embeddings/fr+canine_distilcamembert_allocine_ALL_TP.joblib. This is quite fast compared to the previous command (around 5 minutes).

💡 Notebook step-by-step: run_encode_main.ipynb

💡 Notebook step-by-step for encode_samplewise_features: encode_samplewise_features.ipynb

💡 Notebook for feature extraction: feature_extraction.ipynb

📥 You can download the embeddings from HuggingFace:

python scripts/download_data.py attack_embeddings

3.2. Split data by model and trained dataset

python scripts/make_official_dataset_splits.py

3.3. Distribute data for detection experiments

Create train.csv, val.csv and test.csv under data_tcab/detection-experiments/ directory.

python scripts/distribute_experiments.py

📝 Usage

usage: distribute_experiments.py [-h] [--target_dataset TARGET_DATASET]
                                 [--target_model TARGET_MODEL]
                                 [--embeddings_name EMBEDDINGS_NAME]
                                 [--experiment_setting {clean_vs_all,multiclass_with_clean}]

options:
  -h, --help            show this help message and exit
  --target_dataset TARGET_DATASET
                        Dataset attacked. (default: allocine)
  --target_model TARGET_MODEL
                        Target model type. (default: distilcamembert)
  --embeddings_name EMBEDDINGS_NAME
                        Embeddings name (prefix). (default: default)
  --experiment_setting {clean_vs_all,multiclass_with_clean}
                        Binary or multiclass detection. (default: clean_vs_all)

3.4. Merge experiment data with feature extraction

Take an experiment directory that contains train and test csv files and make them into joblib files using cached features in data_tcab/embeddings/ directory.

python scripts/make_experiment.py

📝 Usage

usage: make_experiment.py [-h] [--experiment_dir EXPERIMENT_DIR]

options:
  -h, --help            show this help message and exit
  --experiment_dir EXPERIMENT_DIR
                        Directory of the distributed experiment to be made. (default:  
                        data_tcab/detection-
                        experiments/allocine/distilcamembert/default/clean_vs_all/)

3.5. Run experiment

Take an experiment directory that contains train and test joblib files, then a classification model and log model, outputs and metrics in a unique subdirectory.

python scripts/run_experiment.py

📝 Usage

usage: run_experiment.py [-h] [--experiment_dir EXPERIMENT_DIR]
                         [--feature_setting {bert,bert+tp,bert+tp+lm,all}]
                         [--model {LR,DT,RF,LGB}] [--skip_if_done] [--test]
                         [--model_n_jobs MODEL_N_JOBS] [--cv_n_jobs CV_N_JOBS]
                         [--solver SOLVER] [--penalty {l1,l2}]
                         [--train_frac TRAIN_FRAC] [--n_estimators N_ESTIMATORS]  
                         [--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
                         [--disable_tune]

options:
  -h, --help            show this help message and exit
  --experiment_dir EXPERIMENT_DIR
                        Directory of the distributed experiment. (default:
                        data_tcab/detection-
                        experiments/allocine/distilcamembert/default/clean_vs_all/)  
  --feature_setting {bert,bert+tp,bert+tp+lm,all}
                        Set of features to use. (default: all)
  --model {LR,DT,RF,LGB}
                        Classification model. (default: LR)
  --skip_if_done        Skip if an experiment is already runned. (default: False)  
  --test                Quick test model. (default: False)
  --model_n_jobs MODEL_N_JOBS
                        No. jobs to run in parallel for the model. (default: 1)
  --cv_n_jobs CV_N_JOBS
                        No. jobs to run in parallel for gridsearch. (default: 1)  
  --solver SOLVER       LR solver. (default: lbfgs)
  --penalty {l1,l2}     LR penalty. (default: l2)
  --train_frac TRAIN_FRAC
                        Fraction of train data to train with. (default: 1)
  --n_estimators N_ESTIMATORS
                        No. boosting rounds for lgb. (default: 100)
  --max_depth MAX_DEPTH
                        Max. depth for each tree. (default: 5)
  --num_leaves NUM_LEAVES
                        No. leaves per tree. (default: 32)
  --disable_tune        Disable hyperparameters tuning with gridsearch. (default:  
                        False)

References

[1] Xie, Zhouhang, et al. "Identifying adversarial attacks on text classifiers". arXiv:2201.08555 (2022).

Acknowledgments

The project relies mainly on the following repositories:

gwatkinson/TextDefendR

TextDefendR: Detecting adversarial attacks on French text classifiers

🔎 Overview

🛠️ Installation

▶️ Quickstart

📝 Usage

1. DistilCamemBERT fine-tuning on Allociné

1.1. Model fine-tuning (optional)

1.2. Model evaluation

2. TCAB Dataset Generation

2.1. Download the Allociné dataset

2.2. Run some attacks with TextAttack (optional)

2.3. Run attacks for TCAB dataset

2.4. Generate the attack dataset

3. TCAB Benchmark

3.1. Encode the dataset with feature extraction

3.2. Split data by model and trained dataset

3.3. Distribute data for detection experiments

3.4. Merge experiment data with feature extraction

3.5. Run experiment

References

Acknowledgments