/adversarial-mt

Code for "Imitation Attacks and Defenses for Black-box Machine Translations Systems"

Primary LanguagePythonMIT LicenseMIT

Imitation Attacks and Defenses for Black-box Machine Translation Systems

This is the official code for "Imitation Attacks and Defenses for Black-box Machine Translation Systems". This repository contains the code for replicating our adversarial attack experiments on your own MT models.

Read our blog and our paper for more information on the method.

Dependencies

This code is written using Fairseq and PyTorch. The code is based on an older version of Fairseq, from this commit. The code is made to run on one GPU or CPU. I used one GTX 1080 for all the experiments. Most experiments run in a few minutes.

Installation

An easy way to install the code is to create a fresh anaconda environment:

conda create -n attacking python=3.6
source activate attacking
pip install -e . # install local version of fairseq
pip install -r requirements.txt

Now you should be ready to go!

Code Structure

The repository is broken down by attack type:

  • malicious_nonsense.py contains the malicious nonsense attack.
  • targeted_flips.py contains the targeted flips attack.
  • universal.py contains the two universal attacks (untargeted and suffix dropper).

The file attack_utils.py contains additional code for evaluating models, the first-order taylor expansion, computing embedding gradients, and evaluating the top candidates for the attack. Overall, the code in this repository is a stripped down and cleaned up version of the code used in the paper. The code is designed to be easy to understand and quick to get started with.

Getting Started

First, you need to get a machine translation model. Fortunately, fairseq already has a number of pretrained models available. See this repository for a complete list. Here we will download a transformer-based English-German model that is trained on the WMT16 dataset.

wget https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2
wget https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2

bunzip2 wmt16.en-de.joined-dict.transformer.tar.bz2
bunzip2 wmt16.en-de.joined-dict.newstest2014.tar.bz2
tar -xvf wmt16.en-de.joined-dict.transformer.tar
tar -xvf wmt16.en-de.joined-dict.newstest2014.tar

Malicious Nonsense

Now we can run an interactive version of the malicious nonsense attack.

export CUDA_VISIBLE_DEVICES=0
python malicious_nonsense.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt  --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang de

The arguments we passed in are: the dataset we downloaded, the model architecture type (we downloaded a Transformer Big architecture), the model checkpoint path, the path to the BPE dictionary, and a flag to enable interactive attacks, respectively. The --source-lang and --target-lang flags are usually ok to omit because fairseq can automatically infer the language pair. If you want to run the attack on the WMT16 test set rather than interactively, you can omit the --interactive-attacks flag and pass in --valid-subset test. If you do not have a GPU, omit the export CUDA_VISIBLE_DEVICES=0 command and also pass in the --cpu argument in the command.

Now you can enter a sentence that you want to turn into malicious nonsense. Let's try something benign like I am a student at the University down the hill. You can also try something more malicious like Barack Obama was shot by a rebel group or whatever your desired adversarial malicious input/output from the model is.

Targeted Flips

The other attacks follow the same arguments as malicious nonsense.

python targeted_flips.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt  --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang de

For targeted flips we currently assume that --interactive-attacks is set.

First, enter the sentence that you want to attack, e.g., I am sad which translates to Ich bin traurig for the English-German model we downloaded above. Then, choose the word in the target side that you want to flip, e.g., traurig and what you want to flip it to, e.g., froh (which means happy/glad in English). Then, you can enter nothing for the optional lists. This should cause the attack to flip the input from I am sad to I am glad.

Of course, I am glad is not "adversarial" in the sense that the model is making a correct translation. We can restrict the attack from adding the word glad into the attack. The attack finds I am lee which the model translates as Ich bin froh.

Universal Attacks

python universal.py wmt16.en-de.joined-dict.newstest2014/ --arch transformer_vaswani_wmt_en_de_big --restore-file wmt16.en-de.joined-dict.transformer/model.pt  --bpe subword_nmt --bpe-codes wmt16.en-de.joined-dict.transformer/bpecodes --interactive-attacks --source-lang en --target-lang de

This commands defaults to the untargeted attack. Passing --suffix-dropper will perform the suffix dropper attack.

References

Please consider citing our work if you found this code or our paper beneficial to your research.

@article{Wallace2020Stealing,
    Author = {Eric Wallace and Mitchell Stern and Dawn Song},    
    journal={arXiv preprint arXiv:2004.15015},
    Year = {2020},
    Title = {Imitation Attacks and Defenses for Black-box Machine Translation Systems}
}

Contributions and Contact

This code was developed by Eric Wallace, contact available at ericwallace@berkeley.edu.

If you'd like to contribute code, feel free to open a pull request. If you find an issue with the code, please open an issue.