mast-group/OpenVocabCodeNLM

Reproducing Bug Entropies

cedricrupb opened this issue · 1 comments

Question: How can the reported bug entropy drop in Table 2 in Big Code != Big Vocabulary be reproduced?

In particular, we are currently testing:
BPE NLM (512) trained on the large code base

From the paper is not fully clear how the bug patches were preprocessed and I hope that you can help me there or provide
us with an replication script.

The following steps were performed and our results differ in up to 1%:

Benchmark: Defects4J 1.0
For every hunk in a bug patch, a defective version and patched version is produced.
The defective version include defective lines (with +) and all surrounding lines (without + - @)
In a similar the patched version were produced.

Patch tokenization:
The created code were preprocessed with CodePrep with the following options:

  • nosplit
  • no_spaces
  • no_unicode
  • full_string
  • no_com
  • max_str_length = 15

We found that CodePrep splits composed operators in Java (like &=, &&, ++, += ).
Therefore, we manually joined them (as it seems to be done in the preprocessed train data from the artifact).

Afterwards we split tokens with Subword-NMT while using
the BPE vocabulary with 10 000 merges supplied by the artifact.
A start token <s> and an end token </s> were added.

Entropy drop:
The code_nlm.py script is used with the pre-trained model (BPE 10K, NLM 512, Large Train).
We use the predict option and export the entropies for each file.
The entropy drop is calculated as the difference between buggy version and patched version.
We normalize by the entropy of the buggy version.

Hi,
I am happy to send you the exact files we used to get the results.
Might take until tomorrow to get access to where they are stored though.

For tokenization we've used the tokenizer from Hellendoorn and Devanbu but that should not make much of a difference.
If I understand correctly you are looking at the entropy on the files before and after the change.
That is problematic and will mask the entropy difference. The reason is that if you look at the whole file there are too many (hundreds or thousands) tokens and you are changing only a few of them. Thus if you take an average you will not see any noticeable difference.

The way this was calculated was to include a couple of surroundings lines from where the change happens.
If I recall correctly the way this works is that it looks at the patch and basically includes the unchanged lines in the file and the before and after lines respectively to make a small code snippet. That snippet is the one you should calculate the entropy upon.
This is similar (but not the same) to what happens in the naturalness for buggy code paper. There the look at each line on its own.
If you are lucky I might still have the code that creates this. If so I'll upload it here.

I suspect that you could get a bit better results than what report in the paper by passing first through the LSTM what appears before the snippet in the file (but not use the entropy for those tokens). We did not do that though because it would be unfair to the rest of the models that do not have any context.