This project contains source code to replicate the experiments from the Probing Pretrained Models of Source Code accepted to EMNLP 2022 Black Box NLP Workshop.
The code structure is as follows:
- scripts for running experiments
src
:models
, pretrained modelsstruct_probing
:code_augs
syntetic changes to codeprobings
utils for probings
CodeAnalysis
: directory with processed data and results
git clone https://github.com/serjtroshin/probings4code
cd probings4code
- go to link and download the data
unzip -l CodeAnalysis.zip
The script was run on CentOS Linux 7
, Python 3.9.2
.
Create a conda environment for the project and install requirements:
conda create -n probings4code python=3.9.12
conda activate probings4code
conda install pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.6 numpy=1.22.3 -c pytorch -y
pip install -r requirements.txt
Install tree-sitter parser for python
and java
by running
bash build.sh
To prepare data for the tasks run and create train
, test
splits:
bash generate_synt_data.sh
The script will output all.json
file with train/test splits train.json
, test.json
in the following subfolders in CodeAnalysis directory: identity
,undeclared
, dfg
, identname
, varmisuse
, readability
, algo
.
To prepare data for ablation study (Appendix) run:
bash prepare_ablation.sh
src/models
directory contains a folder for each pretrained model.
To download BERT, CodeBERT, GraphCodeBERT, CodeT5, and GPT HugginFace checkpoints and tokenizers use:
bash download_models.sh
To run the experiments with PLBART models, please download PLBART pretrained plbart_base
, plbart_large
checkpoints from the original PLBART official repository putting them in the src/models/plbart
folder. Finetuned checkpoints are also avaliable in the official PLBART repo to reproduce the experiments comparing finetuned models (Figure 5). Use src/models/available_models.py
to provide relevant paths for checkpoints.
To save embeddings from all layers for all tasks use
bash save_embeddings.sh
.
The script will save embeddings to the data_all.pkz
in the CodeAnalysis
subfolders.
- NOTE: saving all embeddings requires up to 10TB of disk space.
The run_parallel.py script runs the probing experiments for all models for all tasks saving the results in csv
format for each model-probing pair at CodeAnalysis
directory.
To replicate the experiments with linear probing model (Figure 3, 4) use:
python3 run_parallel.py
to run experiments with the linear model
To run the probing experiments with a 3-layer MLP:
python3 run_parallel.py --probing_model mlp
To run the experiments for ablation study (Appendix) use:
bash run_ablation.sh
Note you can pass --model <model_name> --probing <probing_task_name>
flags to run_parallel.py
to run the particular model on the particular task.
We use the following projects in our work:
If you found this code useful, please cite our work:
@misc{https://doi.org/10.48550/arxiv.2202.08975,
doi = {10.48550/ARXIV.2202.08975},
url = {https://arxiv.org/abs/2202.08975},
author = {Troshin, Sergey and Chirkova, Nadezhda},
keywords = {Software Engineering (cs.SE), Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Probing Pretrained Models of Source Code},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}