- 🔎 Overview
- 🛠️ Installation
▶️ Quickstart- 📝 Usage
- References
- Acknowledgments
TextDefendR is a library for detecting adversarial attacks on NLP classification models. It provides:
- a script to generate attacks on a Transformers model and create a dataset of several attacks;
- several tools to extract embeddings on generated samples;
- experiments to train classifiers for attack detection.
The project reproduces the results from the following paper: "Identifying Adversarial Attacks on Text Classifiers" [1] and uses the associated code (see Acknowledgments).
- Clone the repository
git clone https://github.com/baptiste-pasquier/textdefendr
- Install the project
- With
poetry
(installation):
poetry install
- With
pip
:
pip install -e .
- (Optional) Install Pytorch CUDA
poe torch_cuda
💡 Notebook: quickstart.ipynb
import torch
from datasets import load_dataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from textdefendr.encoder import TextEncoder
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
The dataset contains 9000 samples of attacks on Allociné + 20000 original reviews.
The attack_name
column shows the name of the attack used, or "clean" for original texts.
The perturbed_text
column contains the text modified by an attack, or the original text for unattacked samples.
df = load_dataset("baptiste-pasquier/attack-dataset", split="all").to_pandas()
df = df.sample(1000, random_state=42)
To train a binary classification model, you can consider the binary variable that indicates whether a text comes from an attack.
X = df["perturbed_text"]
y = df["perturbed"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Let's encode text samples with several language model embeddings.
encoder = TextEncoder(
enable_tp=True,
enable_lm_perplexity=True,
enable_lm_proba=True,
device=device)
X_train_encoded = encoder.fit_transform(X_train)
Now it is possible to use any usual classifier.
clf = LogisticRegression(random_state=42)
clf.fit(X_train_encoded, y_train)
X_test_encoded = encoder.transform(X_test)
clf.score(X_test_encoded, y_test)
- Fine-tune with TextAttack:
textattack train --model cmarkea/distilcamembert-base --dataset allocine --num-epochs 3 --learning_rate 5e-5 --num_warmum_steps 500 --weight_decay 0.01 --per-device-train-batch-size 16 --gradient_accumulation_steps 4 --load_best_model_at_end true --log-to-tb
- Fine-tune with Transformers: model_finetuning.ipynb
The fine-tuned model is available on HuggingFace: https://huggingface.co/baptiste-pasquier/distilcamembert-allocine
- Evaluate the fine-tuned model with TextAttack:
textattack eval --model-from-huggingface baptiste-pasquier/distilcamembert-allocine --dataset-from-huggingface allocine --num-examples 1000 --dataset-split test
- Evaluate with Transformers: model_evaluation.ipynb
The model offers an accuracy score of 97%.
This section provides a database of attacks with a fine-tuned DistilCamemBERT model on the task of Allociné reviews classification.
🌐 Reference: https://github.com/react-nlp/tcab_generation
Run
python scripts/download_data.py allocine
This generates a train.csv
, val.csv
, and test.csv
in data/allocine/
directory.
Run
textattack attack --model-from-huggingface baptiste-pasquier/distilcamembert-allocine --dataset-from-huggingface allocine --recipe deepwordbug --num-examples 50
Run
python scripts/attack.py
📝 Usage
usage: attack.py [-h] [--dir_dataset DIR_DATASET] [--dir_out DIR_OUT]
[--task_name TASK_NAME] [--model_name MODEL_NAME]
[--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH]
[--model_max_seq_len MODEL_MAX_SEQ_LEN]
[--model_batch_size MODEL_BATCH_SIZE] [--dataset_name DATASET_NAME]
[--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET]
[--attack_toolchain ATTACK_TOOLCHAIN] [--attack_name ATTACK_NAME]
[--attack_query_budget ATTACK_QUERY_BUDGET]
[--attack_n_samples ATTACK_N_SAMPLES] [--random_seed RANDOM_SEED]
options:
-h, --help show this help message and exit
--dir_dataset DIR_DATASET
Central directory for storing datasets. (default: data/)
--dir_out DIR_OUT Central directory for storing attacks. (default: attacks/)
--task_name TASK_NAME
e.g., abuse, sentiment or fake_news. (default: sentiment)
--model_name MODEL_NAME
Model type. (default: distilcamembert)
--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH
Fine-tuned model configuration to load from cache or download
(HuggingFace). (default: baptiste-pasquier/distilcamembert-
allocine)
--model_max_seq_len MODEL_MAX_SEQ_LEN
Max. no. tokens per string. (default: 512)
--model_batch_size MODEL_BATCH_SIZE
No. instances per mini-batch. (default: 32)
--dataset_name DATASET_NAME
Dataset to attack. (default: allocine)
--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET
Dataset used to train the target model. (default: allocine)
--attack_toolchain ATTACK_TOOLCHAIN
e.g., textattack or none. (default: textattack)
--attack_name ATTACK_NAME
Name of the attack; clean = no attack. (default: deepwordbug)
--attack_query_budget ATTACK_QUERY_BUDGET
Max. no. of model queries per attack; 0 = infinite budget.
(default: 0)
--attack_n_samples ATTACK_N_SAMPLES
No. samples to attack; 0 = attack all samples. (default: 10)
--random_seed RANDOM_SEED
Random seed value to use for reproducibility. (default: 0)
💡 Notebook step-by-step: run_attack.ipynb
💡 Notebook attack statistics: attack_statistics.ipynb
Combines all attacks into a single csv file: data_tcab/attack_dataset.csv
Run
python scripts/generate_attack_dataset.py
Our dataset is available on HuggingFace: https://huggingface.co/datasets/baptiste-pasquier/attack-dataset
🌐 Reference: https://github.com/react-nlp/tcab_benchmark
📥 If you do not have a attack_dataset.csv
dataset, you can download it from HuggingFace:
python scripts/download_data.py attack_dataset
💡 You can run all this section in the following notebook: run_all_benchmark.ipynb
Extract embeddings for the detection model :
- TP: text properties
- TM: target model properties
- LM: language model properties
The result is stored ase a .joblib
file under data_tcab/embeddings/
directory.
Run
python scripts/encode_main.py
📝 Usage
usage: encode_main.py [-h] [--target_model TARGET_MODEL]
[--target_dataset TARGET_DATASET]
[--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET]
[--attack_name ATTACK_NAME]
[--max_clean_instance MAX_CLEAN_INSTANCE] [--tp_model TP_MODEL]
[--lm_perplexity_model LM_PERPLEXITY_MODEL]
[--lm_proba_model LM_PROBA_MODEL]
[--target_model_name_or_path TARGET_MODEL_NAME_OR_PATH] [--test]
[--disable_tqdm] [--embeddings_name EMBEDDINGS_NAME]
[--tasks TASKS]
options:
-h, --help show this help message and exit
--target_model TARGET_MODEL
Target model type. (default: distilcamembert)
--target_dataset TARGET_DATASET
Dataset attacked. (default: allocine)
--target_model_train_dataset TARGET_MODEL_TRAIN_DATASET
Dataset used to train the target model. (default: allocine)
--attack_name ATTACK_NAME
Name of the attack or ALL or ALLBUTCLEAN. (default: ALL)
--max_clean_instance MAX_CLEAN_INSTANCE
Only consider certain number of clean instances; 0 = consider
all. (default: 0)
--tp_model TP_MODEL Sentence embeddings model for text properties features.
(default: sentence-transformers/bert-base-nli-mean-tokens)
--lm_perplexity_model LM_PERPLEXITY_MODEL
GPT2 model for lm perplexity features. (e.g. gpt2,
gpt2-medium, gpt2-large, gpt2-xl, distilgpt2) (default: gpt2)
--lm_proba_model LM_PROBA_MODEL
Roberta model for lm proba features. (e.g. roberta-base,
roberta-large, distilroberta-base) (default: roberta-base)
--target_model_name_or_path TARGET_MODEL_NAME_OR_PATH
Fine-tuned target model to load from cache or download
(HuggingFace). (default: baptiste-pasquier/distilcamembert-
allocine)
--test Only computes first 10 instance. (default: False)
--disable_tqdm Silent tqdm progress bar. (default: False)
--embeddings_name EMBEDDINGS_NAME
Prefix for resulting file name. (default: default)
--tasks TASKS Tasks to perform in string format (e.g.
'TP,LM_PROBA,LM_PERPLEXITY,TM'). (default: ALL)
For instance, to use the DistilCamemBERT
model trained on the Allociné
dataset as the target model, a version of DistilCamemBERT
for the TP and LM probabilities, and a French implementation of GPT2
for the perplexity, run:
python scripts/encode_main.py \
--tp_model cmarkea/distilcamembert-base-nli \
--lm_perplexity_model asi/gpt-fr-cased-small \
--lm_proba_model cmarkea/distilcamembert-base \
--embeddings_name fr+small \
This will create the file data_tcab/embeddings/fr+small_distilcamembert_allocine_ALL_ALL.joblib
. This command is quite long, around 7 hours for the Allociné
dataset.
To generate only the TP features with another model, run :
python scripts/encode_main.py \
--tp_model google/canine-c \
--embeddings_name fr+canine \
--tasks TP
This will create the file data_tcab/embeddings/fr+canine_distilcamembert_allocine_ALL_TP.joblib
. This is quite fast compared to the previous command (around 5 minutes).
💡 Notebook step-by-step: run_encode_main.ipynb
💡 Notebook step-by-step for encode_samplewise_features: encode_samplewise_features.ipynb
💡 Notebook for feature extraction: feature_extraction.ipynb
📥 You can download the embeddings from HuggingFace:
python scripts/download_data.py attack_embeddings
python scripts/make_official_dataset_splits.py
Create train.csv
, val.csv
and test.csv
under data_tcab/detection-experiments/
directory.
python scripts/distribute_experiments.py
📝 Usage
usage: distribute_experiments.py [-h] [--target_dataset TARGET_DATASET]
[--target_model TARGET_MODEL]
[--embeddings_name EMBEDDINGS_NAME]
[--experiment_setting {clean_vs_all,multiclass_with_clean}]
options:
-h, --help show this help message and exit
--target_dataset TARGET_DATASET
Dataset attacked. (default: allocine)
--target_model TARGET_MODEL
Target model type. (default: distilcamembert)
--embeddings_name EMBEDDINGS_NAME
Embeddings name (prefix). (default: default)
--experiment_setting {clean_vs_all,multiclass_with_clean}
Binary or multiclass detection. (default: clean_vs_all)
Take an experiment directory that contains train and test csv files and make them into joblib files using cached features in data_tcab/embeddings/
directory.
python scripts/make_experiment.py
📝 Usage
usage: make_experiment.py [-h] [--experiment_dir EXPERIMENT_DIR]
options:
-h, --help show this help message and exit
--experiment_dir EXPERIMENT_DIR
Directory of the distributed experiment to be made. (default:
data_tcab/detection-
experiments/allocine/distilcamembert/default/clean_vs_all/)
Take an experiment directory that contains train and test joblib files, then a classification model and log model, outputs and metrics in a unique subdirectory.
python scripts/run_experiment.py
📝 Usage
usage: run_experiment.py [-h] [--experiment_dir EXPERIMENT_DIR]
[--feature_setting {bert,bert+tp,bert+tp+lm,all}]
[--model {LR,DT,RF,LGB}] [--skip_if_done] [--test]
[--model_n_jobs MODEL_N_JOBS] [--cv_n_jobs CV_N_JOBS]
[--solver SOLVER] [--penalty {l1,l2}]
[--train_frac TRAIN_FRAC] [--n_estimators N_ESTIMATORS]
[--max_depth MAX_DEPTH] [--num_leaves NUM_LEAVES]
[--disable_tune]
options:
-h, --help show this help message and exit
--experiment_dir EXPERIMENT_DIR
Directory of the distributed experiment. (default:
data_tcab/detection-
experiments/allocine/distilcamembert/default/clean_vs_all/)
--feature_setting {bert,bert+tp,bert+tp+lm,all}
Set of features to use. (default: all)
--model {LR,DT,RF,LGB}
Classification model. (default: LR)
--skip_if_done Skip if an experiment is already runned. (default: False)
--test Quick test model. (default: False)
--model_n_jobs MODEL_N_JOBS
No. jobs to run in parallel for the model. (default: 1)
--cv_n_jobs CV_N_JOBS
No. jobs to run in parallel for gridsearch. (default: 1)
--solver SOLVER LR solver. (default: lbfgs)
--penalty {l1,l2} LR penalty. (default: l2)
--train_frac TRAIN_FRAC
Fraction of train data to train with. (default: 1)
--n_estimators N_ESTIMATORS
No. boosting rounds for lgb. (default: 100)
--max_depth MAX_DEPTH
Max. depth for each tree. (default: 5)
--num_leaves NUM_LEAVES
No. leaves per tree. (default: 32)
--disable_tune Disable hyperparameters tuning with gridsearch. (default:
False)
[1] Xie, Zhouhang, et al. "Identifying adversarial attacks on text classifiers". arXiv:2201.08555 (2022).
The project relies mainly on the following repositories: