This repository contains the code to create the parallel corpora for the Code Style Benchmark (CSB) and the code for the experiments mentioned in the paper such as fine-tuning and few-shot prompting.
The directory src/python
contains the code for the Python transforms.
src/python/requirements.txt
provides the dependencies required to execute the code.
The dependencies can be installed in a Python virtual environment by pip install -r requirements.txt
.
- list_comp_transform.py: This script takes a command line argument a path to a csv file with a column
orig
that points to paths of files that need to be transformed.
It can be run as:
python list_comp_transform.py <path_to_the_input_csv_file>
The output transformed files are written to the same paths with a suffix _transformed_uncomp.py
- decorator_transform.py: This script takes a command line argument a path to a csv file
with a column
content
that has the original code snippets to be transformed.
It can be run as:
python decorator_transform.py <path_to_the_input_csv_file>
The output transformed code snippets with no decorators are written as a new column
decorator_modified
to a csv.
- casing_transform.py: This script takes a command line argument a path to a code file to be transformed. The output after transforming case is printed to stdout.
It can be run as:
python casing_transform.py <path_to_a_Python_code_file>
- docstring_transform.py: This script takes a command line argument a path to a csv file with a column
orig
that points to paths of files that need to be transformed.
It can be run as:
python docstring_transform.py <path_to_the_input_csv_file>
The output transformed files are written to the same paths with a suffix _docstring_transform.py
- comment_transform.py: This script takes a command line argument a path to a csv file with a column
orig
that points to paths of files that need to be transformed.
It can be run as:
python comment_transform.py <path_to_the_input_csv_file>
The output transformed files are written to the same paths with a suffix _comment_transform.py
-
import ExtractProjectRunner into a fresh Eclipse workspace. Eclipse for Committers is the Eclipse version that should work
-
import Java code into a separate workspace
-
run ExtractProjectRunner on the separate workspace. Run either ExtractApplication or RenameApplication. Eclipse should prompt for a choice of these.
The script is used for directly tokenizing the full dataset
train
: training set that namedbq_data_outlier.csv
eval
: eval set that namedevaluation_set.csv
python tokenize_raw_script.py [train|eval]
For keeping dataset in the range of the max sequence length, use this script for filtering out the short sequence data and extracting function/class level codes out of the long sequences. The output will be 2 files: short and long datasets.
python split_long_data.py [train|eval]
Tokenizing and filtering out all the NULL value examples.
python parallel_preprocessing_script.py FEATURES CSV_NAME OUTPUT_PATH
CSV_NAME: CSV file that contains all individual features
OUTPUT_PATH: The output dataset path, whatever you want, which will be a .hf
file
Preprocessing on the docstring transfer will be done for removing very long sequence data.
# inidividual
## i.e. class
### train
python parallel_preprocessing_script.py \
class \
bq_data_outlier_no_class.csv \
train_class_dataset.hf
### eval
python parallel_preprocessing_script.py \
class \
eval_set_individual_feat.csv \
eval_class_dataset.hf
The Seq2Seq Generation finetuning with CodeT5.
# individual
export CUDA_VISIBLE_DEVICES=$(python gpu.py | tail -n 1); python seq2seq_train.py
You will need to configure the training in the script: seq2seq_train.py
fname_prefix
: your repo directory i.e./home/you/code-style-probing/
train_dataset_hf_name
: train set. But in the script, we dowsized it due to the training time constraint. i.e.train_class_dataset.hf
test_dataset_hf_name
: test set. i.e.test_class_dataset.hf
output_dir_name
: checkpoint folder i.e.codet5-class-checkpoints/
model_checkpoint
: checkpoint name, can be the folder or huggingface checkpoint, i.e.Salesforce/codet5-small
inference_only
: whether only do the inference on the test set, i.e.False
down_size_test_set
: whether downsize the test set for saving time. i.e.True
is_baseline
: if baseline, the CodeT5 will be trained from scratch. i.e.False
batch_size
: i.e.16
Usage: seq2seq_inference.py [OPTIONS] INFERENCE_DATASET MODEL_CKPT_PATH
OUTPUT_CSV_FILENAME
Arguments:
INFERENCE_DATASET [required]
MODEL_CKPT_PATH [required]
OUTPUT_CSV_FILENAME [required]
Options:
--batch-size INTEGER [default: 8]
--is-nl / --no-is-nl [default: no-is-nl]
--is-downsize / --no-is-downsize
[default: no-is-downsize]
Example:
rm -rf codestylist ; \
export CUDA_VISIBLE_DEVICES=1; \
python seq2seq_inference.py \
/data/code/curated_eval_set/curated_docstring_dataset_with_prompt.hf \
codestylist/combined_code_style_transformer \
combined_model_results/docstring.non_downsized.output.csv \
--batch-size 64 \
--is-nl ;
- DATASET_PATH: The path of the test set. (
.hf
) - CHECKPOINT: The model checkpoint path.
- OUTPUT_FILE_PATH: The path of the prediction output
- IS_NL: [true|false], whether use the control tokens.
- IS_DOWNSIZE: [true|false], whether need to downsize the test set, will downsize it to 2000 examples.
The output will be a prediction file that contains input/prediction/label.
The removal of
codestylist
folder is because the trainer will create a foler automatically and will have error if we try to load the model from the hub, it will try to load from the empty folder created by trainer instead. So it is needed to remove the folder first no matter whether it exists.
Please see seq2seq_eval.ipynb
(individual) for evaluation.
We now have a script evaluate_score
for running the evaluation:
Usage: evaluate_score.py [OPTIONS] PRED_DIR OUTPUT_DIR TARGET_FEAT
Arguments:
PRED_DIR [required]
OUTPUT_DIR [required]
TARGET_FEAT [required]
Options:
--is-nl-tokens-added / --no-is-nl-tokens-added
[default: no-is-nl-tokens-added]
--clean-diff / --no-clean-diff [default: clean-diff]
Example:
python evaluate_score.py \
/data/ken/data/code/decorator.output_post_process.csv \
./test.json decorator \
--clean-diff
- PRED_DIR: You prediction csv file
- OUTPUT_DIR: You score output json file name
- is-nl-tokens-added: N/A
- clean-diff: will clean some inconsistent characters caused by AST parse and unparse before calculating DiffBLEU
The scripts prompting_for_test_set.py
, prompting_for_codenet.py
, and prompting_for_codenet_java.py
all do the 1-shot prompting mentioned in the paper.
The main difference is the input data format.
Note that we use a proprietary API which hosts multiple models for inference, so we need two environment variables API_KEY
and API_ENDPOINT
to connect to that service.
In theory, any model inferencing API can be used instead of this service.
For example, prompting_for_test_set.py
takes the evaluation dataset csv path from eval_parallel_corpora_neurips.zip
(this file can be found at the CSB dataset location) and a task name as the two command line parameters.
The task name can take one of these values: ['list_comp', 'decorators', 'casing_java', 'casing_python', 'docstrings', 'comments', 'method_extraction' (for code encapsulation)]
It can be run as follows:
python prompting_for_test_set.py <path_to_evaluation_dataset_csv> <task_name>