Implementation of Goldilocks attack
From [add link to paper here]
Overview
This is an implementation of the Goldilocks attack on code2seq. The attack is performed in a number of steps, but the automation script should do the heavy lifting of putting everything together.
Installation
-
To install this attack, you will first need to install Python 3.5+ and all of the dependencies listed in
python/requirements.txt
. -
You will need an installation of the Java Runtime Environment (version 8+). I recommend
sudo apt install default-jre
. -
This repository includes a modified implementation of code2seq (original can be found at https://github.com/tech-srl/code2seq). Add our implementation of code2seq to your PYTHONPATH.
-
The file
python/attack_config.py
includes a few parameters that can be modified to configure the attack, including the dataset to perturb and the code2seq model to use. Take a look at this file and determine the appropriate configuration. To reproduce the java-large results of the paper, you do not have to change this file. -
If you wish to reproduce the java-large results of the paper, you can run the script
bash/download-model-and-dataset.sh
and follow the instructions. Run this script from the bash directory. -
To get results, run
automate.py --topk [topk] --dictionary {l2|frequency|random} --type {same|different|single}
and look in the generateddata/results
folder. See instructions in the./python
section for further details.
./code2seq
The implementation of code2seq with modifications to the preprocessing script preprocess2.sh
,
along with a precompiled version of JavaExtractor.
./bash
download-model-and-dataset.sh
is a script to download and automatically install the java-large model and the java-small dataset so that they are ready to be attacked.
./java
The implementation of RenameVariable, a tool that uses JavaParser to rename local variables for the attack.
./python
automate.py --topk [topk] --dictionary {l2|frequency|random} --type {same|different|single}
runs the automate script to generate a perturbation using the l2, frequency, or random dictionary (--dictionary
) with a top-k parameter oftopk
(integer). The attacks use a concatenation strategy of either single, 5-diff, or 5-same (--type
).{l2|frequency|all}_dictionary.py [topk]
generates a vocabulary of the topk subtokens by the given metric.attack_config.py
has a few parameters to specify location of the code2seq model, input dataset, and output files.filter_has_local_variables.py [c2s file]
generates a file in c2s format from the c2s file input of only methods that contain local variables.perturb_dataset.py [output directory] [vocabulary file] [subtoken concatenation strategy]
calls the RenameVariable tool to perturb the dataset specified inattack_config.py
with tokens from vocabulary file that are concatenated with concatenation strategy.modified_model.py
has an implementation of the code2seq model that exposes important internals.
./data
Miscellaneous data, including input and output files.