TellinaTool/nl2bash

Understanding Bash Results

DNGros opened this issue · 5 comments

Hello,

First let me say that I think what you are doing is awesome! nl2bash is a really hard task, and you have made some great progress on it.

While trying to replicate some of the results from the paper I am having some difficulty. When running ./bash-token.sh I get results at the end of

100 examples evaluated
Top 1 Match (template-only) = 0.290
Top 1 Match (whole-string) = 0.050
Average top 1 Template Match Score = 0.690
Top 3 Match (template-only) = 0.350
Top 3 Match (whole-string) = 0.060
Average top 3 Template Match Score = 0.765

From my understanding this is saying that 29% of the top-1 output templates from the network exactly match the ground truth. I am not completely sure what the 0.69 Template Match Score number is, but it seems to come from percent of 1-gram tokens that overlap between the ground truth and the prediction (correct me if I'm wrong). Is this the value that is reported in Table 2 of the "Program Synthesis from Natural Language Using Recurrent Neural Networks" paper? When trying to replicate the result with our own seq2seq code we are seeing results in the range of 20-30% template match. This seems consistent with the "Top 1 Match (template-only)" output. We are just trying to see if this is correct, or if it should actually be around 70%.

I would appreciate if you could clarify what the expected outputs are.

Thanks!

First of all,

./bash-token.sh trains the plain seq2seq model.
./bash-token.sh --normalized shall be used to reproduce the Tellina experiments.

Second, I apologize that I haven't uploaded the complete manual evaluation data here yet: https://github.com/TellinaTool/awesome_nmt/tree/master/data/bash/manual_judgements

Therefore you are currently seeing a result lower than it should be since many ground truths are missing. I will fix this issue by uploading the manual evaluation data ASAP.

Thank you for getting back to me!

Unfortunately it wont be until next week until I am able to get access to a machine I can train on again, but I will look more into the results from the --normalized flag. It also looks like you have added some code in 96c0760 which I was not up to date to at the time and need to pull. Thank you for letting me know that the flag is important.

Uploading any other data you have would be helpful if you get the chance. I also have plans for some work to try and collect more data, which I plan on sharing.

What is the format of the data in the manual_judgements directory?

They are in .csv format. The function which loads them for evaluation can be found here:
https://github.com/TellinaTool/nl2bash/blob/master/eval/eval_tools.py#L730

I'm hoping to upload the rest of the manual judgment data sometime next week before Xmas break :)

It's great to hear that you're on the track of collecting more data and sharing. Looking forward to it!

Hi @DNGros, I've uploaded the manual judgements provided by the programmers and cleaned up the scripts for reproducing the experiments.

Please let me know if you continue to meet any problems.

@todpole3 Thank you! That's great.

I'll go ahead and close this issue as I think my initial confusion has been resolved.