Retraining
fengzhangyin opened this issue · 6 comments
When I retrain the model, the program will end automatically after four epochs. And the results were far below of the paper. I follow the instructions completely, but why can not I reproduce the results ?
Hi Zhangyin,
If you are using the make train
command as documented in the README , the first experiment it runs is the simplest token-wise Seq2Seq baseline, and its results is significantly worse than our best model.
You may use this command to train the best model reported in the NL2Bash paper:
./bash-copy-partial-token.sh
Let me know if this helps.
@todpole3 Sorry, it doesn't help.
Firstly, I use make train
command to train and find the problem. Then I use bash command as ./bash-copy-partial-token.sh
to train the seven models. However, the same problem still exists, which is the program will end very quickly during training and the results are very below. I want to know that how many epochs will be trained for each model to achieve the best results.
And you can try to re-execute the make train
or ./bash-copy-partial-token.sh
This is the result on dev of my execution of the ./bash-copy-partial-token.sh
Now I don't know what should i do. Thanks
Hi Zhangyin,
I'm very sorry about the confusion caused.
First of all, the "Average top k BLEU Score" and "Average top k Template Match Score" you obtained are comparable and slightly higher than the ones we reported in Appendix C of the paper, hence I believe you have already retrained the model and obtained the predicted commands correctly.
The question remains why you are getting lower "Top k Match (template-only)" and "Top k Match (whole-string)" scores compared to those reported in Table 8. The scores in table 8 are accuracies as a result of manual evaluation. I have already uploaded all our manual evaluations here. The evaluation function needs to read those in order to properly output the manual evaluation accuracy. If for some reason it failed to read those files, the resulting accuracy would be significant lower since the model output contain many false negatives (correct Bash commands that are not included in our data collection).
I will check if the pointer to the "manual_judgements" directory is correctly set up and get back to you.
Sorry about the delay. I pushed a few fixes which I think should address the issue to a great extent.
I think the issue also exposes a caveat of the manual evaluation methodology we resort to in this paper. I will discuss it in detail below.
A) I reproduced the dev set manual evaluation results of the "Tellina" model (Table 8, last row) by running the following commands:
./bash-token.sh --normalized --fill_argument_slots --train 0
./bash-token.sh --normalized --fill_argument_slots --gen_slot_filling_training_data 0
./bash-token.sh --normalized --fill_argument_slots --decode 0
./bash-token.sh --normalized --fill_argument_slots --manual_eval 0
B) I reproduced the dev set manual evaluation results of the "Sub-Token CopyNet" model (Table 8, second to last row) by running the following commands plus the additional effort of inputing a few manual judgement myself.
./bash-copy-partial-token.sh --train 0
./bash-copy-partial-token.sh --decode 0
./bash-copy-partial-token.sh --manual_eval 0
There are two reasons which causes you to get lower evaluation scores initially.
First, in our paper, we only manually annotated 100 examples (randomly sampled) in the dev set, but simply running the script ./bash-copy-partial-token.sh --decode
generates and returns the evaluation results on the full dev set (549 examples). Hence the manual annotations are missing for 449 examples and you will get much lower scores on them since the false negative predictions for them are counted as wrong.
Second, due the randomness of the NN implementation, the models may output different Bash commands across different runs. And there might be new commands generated that was not in the manual judgements we already collected. There are false negatives among those too, and they need to be rejudged. (I did not do it correctly in fixing the random seed for Tensorflow, hence I still observe differences in the predictions across runs, although the evaluation results do not change significantly.)
The --manual_eval
flag calls the manual evaluation script which will open a commandline interface for you to input the judgement regarding a particular command prediction if it was not previously judged. You you can input the judgement based on your domain knowledge and the script will proceed until all predicted outputs were annotated (as shown below).
#92. List all files under current directory matching the regex '.*\(c\|h\|cpp\)'
- GT0: find . -type f -regex '.*\(c\|h\|cpp\)' -exec ls {} \;
> find . -regex '.*\(' -print0 | xargs -0 -I {} ls -l {}
CORRECT STRUCTURE? [y/reason] y
CORRECT COMMAND? [y/reason] n
new judgement added to ../data/bash/manual_judgements/manual.evaluations.additional
The script prints the manual evaluation metrics in the end.
100 examples evaluated
Top 1 Command Acc = 0.370
Top 3 Command Acc = 0.490
Top 1 Template Acc = 0.500
Top 3 Template Acc = 0.620
The caveat is that to develop with our codebase, one needs to constantly redo the manual judgement step as any model change will result in new predictions that were previously unseen. This makes the dev evaluation tedious and subjective. As the researchers themselves may not be proficient enough in Bash to do the judgement and different researchers may use different annotation standards.
This problem is even more serious for testing, as the researcher needs to rerun the 3-annotator manual evaluation experiment to generate numbers comparable to our paper. This option could not generalize well.
Hence my current suggestion would be 1) to make use of the automatic evaluation metrics proposed in appendix C as a coarse guidance for development (keep in mind that they do not strictly correlated with the manual evaluation metrics); (I would also encourage you to think of additional automatic evaluation methods) and 2) in case you are reporting new manual evaluation results on the test set, it is better to have the annotators judge both our system output and your system output such that the numbers are comparable (different annotators may have different standard which makes the evaluation scores produced by different sets of annotators incomparable).
Meanwhile, I will think about better automatic evaluation methodology and how to build a common platform for test evaluation. Suggestions are welcome. Thank you for drawing this to our attention!