IncrementalLMScorer discards probability of first token

Question

IncrementalLMScorer discards probability of first token

Ubadub opened this issue 7 months ago · 5 comments

Consider this section of code from IncrementalLMScorer:

## Ignore the probabilities of the first token.
effective_ids = [id[1:] for id in ids]

If I'm understanding this correctly, the class is discarding the probability the model assigns to the first token in every element of a batch. I understand why such logic would make sense in the context of a model that uses a BOS token; but does this mean that this class is unusable for models that do not use a BOS token? It is not at all clear from the docs that this class is only meant to be used with BOS token models.

A BOS token is mentioned in other places in the code- for example, as a Boolean argument to prepare_text- but in such places it's clearly marked as optional, with the default being False. So I'm a little confused by the lack of such optionality in the function above.

Am I understanding this correctly? If so, is there a workaround (besides reduplication of the code)?

Answer 1 · 2024-02-08T23:25:02.000Z

Thanks for raising this issue! When LMs don't use a BOS token, it makes no sense to have probabilities of the first token, since logits are computed given some context. This is the default case.

But for cases where LMs do have a BOS token, the first token ends up being the BOS token itself and now it is the BOS token that is assigned 0 probability -- in such cases you could enable the bos_token option by setting it to True.

The compute_stats function doesn't need to include this functionality since all of this is handled by the prepare_text and the prime_text functions, all compute_stats does is handle logits given some input.

Does this make sense?

Answer 2 · 2024-02-09T01:06:30.000Z

Thanks for raising this issue! When LMs don't use a BOS token, it makes no sense to have probabilities of the first token, since logits are computed given some context. This is the default case.

But for cases where LMs do have a BOS token, the first token ends up being the BOS token itself and now it is the BOS token that is assigned 0 probability -- in such cases you could enable the bos_token option by setting it to True.

The compute_stats function doesn't need to include this functionality since all of this is handled by the prepare_text and the prime_text functions, all compute_stats does is handle logits given some input.

Does this make sense?

Yes, and thank you for your quick reply. I understand. Would you happen to have any advice for doing a BLiMP type experiment for a model that does not take use a BOS token? If all sentence pairs in the BLiMP corpus had an identical first word, this wouldn't matter, but for some pairs, this is not the case (e.g. matrix_question_npi_licensor_present and left_branch_island_echo_question). Is the simple LM BLiMP evaluation method intelligible in such cases?

Answer 3 · 2024-02-09T17:41:18.000Z

I think you're making an important point here with the difference in first word for LMs without a BOS token! For these cases, I guess the difference in the first token would be more indirectly reflected in the log-probs assigned by the model in the sense that p(apple | an) is likely much much greater than p(apple | a). I am unsure about other solutions...

Answer 4 · 2024-02-09T21:20:40.000Z

Thanks for your very helpful responses (and the very helpful library). Yes I did some more thinking about this and I understand your point. I'll close the issue now :)

Answer 5 · 2024-02-09T23:13:02.000Z

No worries -- I still think your point holds, fwiw!