Bug in prediction code
malkin1729 opened this issue · 8 comments
I believe there is a bug around the following line in the baseline prediction code:
Specifically, a batch given as input to the LM is constructed by stacking several candidate lists of tokens, which are first made the same length by padding with the token 0. However, the scoring function that follows (L252-267) takes the likelihoods of these padding tokens into account when computing total log-likelihood of each candidate.
A probe that shows this is indeed a bug: change the default padding token in that line from 0 to 50256. The predictions and accuracies obtained by running bash experiments/piqa/predict_baseline_zero_shot.sh 0 dev
will change.
If the code is modified to ignore the padding tokens when scoring, the baseline results improve significantly. For example, on the PIQA validation set, the accuracy of GPT-XL, computed correctly, is 69.6% (compared with 62.6% as reported in Table 2 of your paper).
Thanks for letting me know. I will look at it early next week.
Both the model and the baseline scripts default to padding of zero when there is no padding token, so I'll re-run everything after changing the cross entropy loss to ignore the padding tokens.
Thanks. I'm curious to see what the results will be like with the change.
This is really interesting work and a helpful codebase. Stumbled upon it while working on a project that is also about toying with contexts for better pretrained LM inference.
Thanks again for pointing out the bug. I changed the padding to:
tokenizer.pad_token = tokenizer.eos_token
When tokenizer.pad_token
is None
. I also changed the cross-entropy loss to ignore the padded tokens.
The (development) accuracy is indeed higher in general, but the trend generally holds (baseline < knowledge model ~ self-talk).
Task | Type | LM | Knowledge Source | Accuracy |
---|---|---|---|---|
COPA | Baseline | GPT2-L | - | 68.0 |
Ext. Knowledge | GPT2-XL | COMET | 67.0 | |
Self-Talk | GPT2-XL | DistilGPT2 | 69.0 | |
CommonsenseQA | Baseline | GPT2-XL | - | 36.7 |
Ext. Knowledge | GPT2-L | Google Ngrams | 42.6 | |
Self-Talk | GPT2-XL | XLNet-B | 41.7 | |
MC-TACO | Baseline | GPT2-XL | - | 61.2 |
Ext. Knowledge | GPT2-XL | COMET | 63.2 | |
Self-Talk | GPT2-XL | GPT | 63.0 | |
Social IQa | Baseline | GPT2-L | - | 43.7 |
Ext. Knowledge | GPT2-XL | ConceptNet | 46.3 | |
Self-Talk | GPT2-XL | XLNet-B | 45.9 | |
PIQA | Baseline | GPT2-XL | - | 70.2 |
Ext. Knowledge | GPT2-XL | ConceptNet | 70.9 | |
Self-Talk | GPT2-XL | GPT2 | 71.0 | |
WinoGrande | Baseline | GPT2-XL | - | 55.6 |
Ext. Knowledge | GPT2-XL | COMET | 55.2 | |
Self-Talk | GPT2-XL | DistilGPT2 | 55.2 |
Thanks for the quick fix and for sharing the new numbers!
Sharing the results of the aforementioned project (ACL'22, to appear).
https://arxiv.org/abs/2110.08294
It would be interesting to see if there is an additive effect of coherence boosting on self-talk, with contrastive scoring to increase the effect of the self-talk explanations.
Nice, thanks for sharing your paper! It might work well when using multiple (instead of one) clarifications.
I just ran a quick experiment with CommonsenseQA, using a score
(1-alpha)(score with clarifications)+alpha(score without clarifications)
with alpha<0 to boost the self-talk effect. There indeed seems to be a benefit, with peak accuracy improving an additional 2% over self-talk alone (43.8%).
Because contrasting with unconditional accuracy (as done in our paper) gives an improvement of over 10% on this dataset, we could think of selecting three log-linear mixture weights instead of two: for the standard query+response, for the unconditional response, and for the response with clarifications. It should be very quick to test out (forthcoming!).
If you'd like to discuss further, perhaps we could take it to email?
Sure, I'd love to hear more, we can discuss it in email. It would probably take me a while to find time read the paper in depth and respond :)