Implementation of Goldilocks attack

From [add link to paper here]

Overview

This is an implementation of the Goldilocks attack on code2seq. The attack is performed in a number of steps, but the automation script should do the heavy lifting of putting everything together.

Installation

To install this attack, you will first need to install Python 3.5+ and all of the dependencies listed in python/requirements.txt.
You will need an installation of the Java Runtime Environment (version 8+). I recommend sudo apt install default-jre.
This repository includes a modified implementation of code2seq (original can be found at https://github.com/tech-srl/code2seq). Add our implementation of code2seq to your PYTHONPATH.
The file python/attack_config.py includes a few parameters that can be modified to configure the attack, including the dataset to perturb and the code2seq model to use. Take a look at this file and determine the appropriate configuration. To reproduce the java-large results of the paper, you do not have to change this file.
If you wish to reproduce the java-large results of the paper, you can run the script bash/download-model-and-dataset.sh and follow the instructions. Run this script from the bash directory.
To get results, run automate.py --topk [topk] --dictionary {l2|frequency|random} --type {same|different|single} and look in the generated data/results folder. See instructions in the ./python section for further details.

`./code2seq`

The implementation of code2seq with modifications to the preprocessing script preprocess2.sh, along with a precompiled version of JavaExtractor.

`./bash`

download-model-and-dataset.sh is a script to download and automatically install the java-large model and the java-small dataset so that they are ready to be attacked.

`./java`

The implementation of RenameVariable, a tool that uses JavaParser to rename local variables for the attack.

`./python`

automate.py --topk [topk] --dictionary {l2|frequency|random} --type {same|different|single} runs the automate script to generate a perturbation using the l2, frequency, or random dictionary (--dictionary) with a top-k parameter of topk (integer). The attacks use a concatenation strategy of either single, 5-diff, or 5-same (--type).
{l2|frequency|all}_dictionary.py [topk] generates a vocabulary of the topk subtokens by the given metric.
attack_config.py has a few parameters to specify location of the code2seq model, input dataset, and output files.
filter_has_local_variables.py [c2s file] generates a file in c2s format from the c2s file input of only methods that contain local variables.
perturb_dataset.py [output directory] [vocabulary file] [subtoken concatenation strategy] calls the RenameVariable tool to perturb the dataset specified in attack_config.py with tokens from vocabulary file that are concatenated with concatenation strategy.
modified_model.py has an implementation of the code2seq model that exposes important internals.

`./data`

Miscellaneous data, including input and output files.

jakespringer/goldilocks_attack

Implementation of Goldilocks attack

Overview

Installation

./code2seq

./bash

./java

./python

./data

`./code2seq`

`./bash`

`./java`

`./python`

`./data`