Training NMN model on full DROP (with numerical answers)

Question

Training NMN model on full DROP (with numerical answers)

amritasaha1812 opened this issue 4 years ago · 11 comments

I have some queries regarding retraining your model on the subset of DROP that have only numerical answers. This subset has ~45K training instances but the preprocessing scripts are able to generate additional supervision only for ~12K training instances and the final model is being able to only use these for training (the remaining instances are not used).

Specifically I am interested in training a version without 'execution_supervised' or 'qattn_supervised'. In all of the cases (by removing one of these supervisions or both) the validation results are quite poor (around 13-15% Exact Match). Trying different learning rates and beam sizes also did not improve.

Also by turning on all the kinds of supervision in the config file, on this subset of data, i am getting poor validation performance of ~20% (Exact Match) . Does this sound reasonable or am i doing something wrong? It would be great if you can shed some light on this.

I am using the tokenize.py code followed by preprocessing scripts in datasets/drop/preprocess to generate the different types of supervision in the training data and use merge_data.py to get the final training data.
At the end, i get the following supervision from the training data {'program_supervised': 9669, 'qattn_supervised': 7260, 'execution_supervised': 1750}).

For your reference i am attaching the scripts i used for generating this supervision for the training data, Can you kindly let me know if i am doing it the right way.

generate_annotations.txt

Answer 1 · 2020-07-06T14:57:17.000Z

Our current model is only capable of handling limited numerical operations, such as comparisons, min/max, counting. Since DROP requires a much more diverse set of reasoning, it is not surprising to see such a low performance on all of number-answer DROP questions.

The program and aux-module-output supervision is generated for a few types of questions (in the iclr paper) based on heuristics and hence you only see supervision for a partial set.

"the final model is being able to only use these for training" -- I suspect this is happening due to curriculum learning; the default SUPEPOCHS flag in train.sh is set to 5 which makes the model train only on questions with supervision for the first 5 epochs. After that it should be training on all questions. Please let me know if that is not the case and I will dig deeper.

Your scripts for generating the data seem fine; the issue is the limitation of the current model in its reasoning capability.
You can also try to visualize the trained model's output using scripts/iclr/predict.sh to see what kind of errors it makes.

Hope this helps. Please let me know if you have additional questions.

Answer 2 · 2020-07-06T15:09:36.000Z

Also see our ACL paper on why performance on the ICLR subset is very likely worse when you train the model on all of these questions. (As ACL is ongoing right now, here's a link to the conference page for the paper.)

Answer 3 · 2020-07-07T02:23:08.000Z

Thanks a lot for clarifying my queries and thanks for the pointer to the ACL work!. This was really helpful.
I have one more query though. In the conclusion of the ICLR paper, the numbers reported on full DROP-valid is 65.4 F1 and with the pretrained model made available by you i am getting comparable numbers.
But when i use the preprocessing scripts and train the model on the "numerical answer" subset of DROP the numbers on validation set are much poorer (~20 F1). I am not sure why the performance gap and whether i am doing it the right way.
Thanks!

Answer 4 · 2020-07-07T03:01:36.000Z

That indeed is strange. Did you try evaluating the full model on the numerical answer subset to see how that performs.
Also, did you try to train the model on the span answer subset (anything not in the numerical answer subset)?

Could you share your data filtering script and I can try to train it myself and see what the issue might be.

Answer 5 · 2020-07-07T03:42:20.000Z

Yes by evaluating the pretrained model on the numerical answer subset i got ~65 F1. My data filtering script is simply based on the gold answer, i.e. in the raw DROP json file where answer["number"] is a non-empty string. It would be great if you can kindly check if you are getting similar result.
I will also train a model on the non-numerical answer subset and get back to you.

Answer 6 · 2020-07-07T03:50:29.000Z

Sure, I'll try and let you know.

Answer 7 · 2020-07-09T03:30:51.000Z

Hi, I was able to get 58% on the DROP numerical answer dev set when all supervisions are in place. I think the issue was that even though "filter_for_epochs" was set to 5, it was still using the filtered instances after epoch 5. Thanks for all your help!

Answer 8 · 2020-07-09T13:06:54.000Z

That sounds great. That is really strange; two questions, (a) how did you fix it? (b) I am actually getting a dev performance of 48 F1; I have ~47.3k training / ~5.8k dev instances. Does that seem right?

Answer 9 · 2020-07-10T04:35:05.000Z

(b) Yes the training and dev data size is right. Is this 48% with all the types of supervision in place? And this is after running how many epochs?
(a) For some reason the 'epoch num' variable in the filter_iterator was staying at 0. I wanted to avoid making changes in the code, so i just set the "filter_for_epochs" to -1, after it has trained for 5 epochs.

Answer 10 · 2020-07-10T04:47:40.000Z

(b) I think yes, I did a quick fix by only taking in the HowManyYards, YearDiff, and Count questions w/ supervision. Maybe I made a mistake there somehow. This was after ~17 epochs

(a) That is strange; I don't see a reason for that happening. But glad you figured it out.

Answer 11 · 2020-07-15T18:01:24.000Z

Thanks Nitish for all your help! I have another observation which is a little surprising. I trained the model on the numerical-answer subset of your ICLR version of pruned drop (i am using the preprocessed train files provided by you and simply filtering out based on the gold answer type) and i see the following ablation results when i remove one or some or all of the 3 types of supervision (execution supervision, query attention supervision and program supervision) by changing the options provided by you.
As you can see, when query attention supervision is enabled performance is much worser. Is there any way to explain this?