facebookresearch/MetaICL

Confusion with classification when having multi tokens

wookjeHan opened this issue · 2 comments

Hello, I am confused with the method of classifying when the option includes multi-token labels.

Let's assume that the classification task has two options for an answer which are [favor, against] and has the input as "I do not have an opinion about this move".

If we assume a prompt template as

Input: I do not have an opinion about this move
Output: 

what I understand is MetaICL calculates each loss of

Input: I do not have an opinion about this move
Output: favor

and

Input: I do not have an opinion about this move
Output: against

and calculate average losses where favor and against are located, and compare the average losses.
However, I am wondering whether it is a fair classification method.
To illustrate, let's say that 'favor' is tokenized by ['fa', 'vor'] and 'against' is tokenized by ['against'].

Then, I think that the loss according to the 'vor' could be heavily affected by prior 'fa' and may be significantly small.

This can lead to smaller average loss and give 'favor' more advantage than 'against'.

I am curious about your comments.

Best regard,
Wookje Han

Hi @wookjeHan,

That's a great question. I think this is possibly a case on some datasets.

In practice, the label that is tokenized into multiple tokens is usually a multi-word label (rather than one word split into multiple BPE tokens), e.g., non-hate -> non, - and hate. So I think it is less likely that the later tokens have significantly lower losses than the first one.

I just looked at 20 evaluation datasets from class_to_class split. 10 of 20 are those that contains at least one label split into multiple BPE tokens, and 7 out of 10 are those with multi-word labels. (The other three has entailed or entailment that apparently split into multiple tokens.)

(Also note that in your example, favor won't actually split into multiple tokens because the tokenizer considers favor as a word rather than favor. favor is split into 2 tokens, but favor is one token.)

I guess another option is to consider the sum of losses assigned to each token, rather than the average. This paper explored that, but found that on average using the average is better than the sum. So we sticked with using the sum.

Hi @shmsw25
Thanks for your kind reply.
Maybe calibrating the later token probability by considering the prior token could improve the classification performance!
Thank you!