Bug in prediction code

I believe there is a bug around the following line in the baseline prediction code:

self_talk/source/predictors/multiple_choice_baseline_predictor_unsupervised.py

Line 224 in bbfc675

    
           pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0

Specifically, a batch given as input to the LM is constructed by stacking several candidate lists of tokens, which are first made the same length by padding with the token 0. However, the scoring function that follows (L252-267) takes the likelihoods of these padding tokens into account when computing total log-likelihood of each candidate.

A probe that shows this is indeed a bug: change the default padding token in that line from 0 to 50256. The predictions and accuracies obtained by running bash experiments/piqa/predict_baseline_zero_shot.sh 0 dev will change.

If the code is modified to ignore the padding tokens when scoring, the baseline results improve significantly. For example, on the PIQA validation set, the accuracy of GPT-XL, computed correctly, is 69.6% (compared with 62.6% as reported in Table 2 of your paper).

Thanks for letting me know. I will look at it early next week.

Both the model and the baseline scripts default to padding of zero when there is no padding token, so I'll re-run everything after changing the cross entropy loss to ignore the padding tokens.

Thanks. I'm curious to see what the results will be like with the change.

This is really interesting work and a helpful codebase. Stumbled upon it while working on a project that is also about toying with contexts for better pretrained LM inference.

Thanks again for pointing out the bug. I changed the padding to:

tokenizer.pad_token = tokenizer.eos_token

When tokenizer.pad_token is None. I also changed the cross-entropy loss to ignore the padded tokens.

The (development) accuracy is indeed higher in general, but the trend generally holds (baseline < knowledge model ~ self-talk).

Task	Type	LM	Knowledge Source	Accuracy
COPA	Baseline	GPT2-L	-	68.0
	Ext. Knowledge	GPT2-XL	COMET	67.0
	Self-Talk	GPT2-XL	DistilGPT2	69.0
CommonsenseQA	Baseline	GPT2-XL	-	36.7
	Ext. Knowledge	GPT2-L	Google Ngrams	42.6
	Self-Talk	GPT2-XL	XLNet-B	41.7
MC-TACO	Baseline	GPT2-XL	-	61.2
	Ext. Knowledge	GPT2-XL	COMET	63.2
	Self-Talk	GPT2-XL	GPT	63.0
Social IQa	Baseline	GPT2-L	-	43.7
	Ext. Knowledge	GPT2-XL	ConceptNet	46.3
	Self-Talk	GPT2-XL	XLNet-B	45.9
PIQA	Baseline	GPT2-XL	-	70.2
	Ext. Knowledge	GPT2-XL	ConceptNet	70.9
	Self-Talk	GPT2-XL	GPT2	71.0
WinoGrande	Baseline	GPT2-XL	-	55.6
	Ext. Knowledge	GPT2-XL	COMET	55.2
	Self-Talk	GPT2-XL	DistilGPT2	55.2

Thanks for the quick fix and for sharing the new numbers!

Sharing the results of the aforementioned project (ACL'22, to appear).
https://arxiv.org/abs/2110.08294
It would be interesting to see if there is an additive effect of coherence boosting on self-talk, with contrastive scoring to increase the effect of the self-talk explanations.

Nice, thanks for sharing your paper! It might work well when using multiple (instead of one) clarifications.

I just ran a quick experiment with CommonsenseQA, using a score
(1-alpha)(score with clarifications)+alpha(score without clarifications)
with alpha<0 to boost the self-talk effect. There indeed seems to be a benefit, with peak accuracy improving an additional 2% over self-talk alone (43.8%).

Because contrasting with unconditional accuracy (as done in our paper) gives an improvement of over 10% on this dataset, we could think of selecting three log-linear mixture weights instead of two: for the standard query+response, for the unconditional response, and for the response with clarifications. It should be very quick to test out (forthcoming!).

If you'd like to discuss further, perhaps we could take it to email?

Sure, I'd love to hear more, we can discuss it in email. It would probably take me a while to find time read the paper in depth and respond :)