Confusion with classification when having multi tokens
wookjeHan opened this issue · 2 comments
Hello, I am confused with the method of classifying when the option includes multi-token labels.
Let's assume that the classification task has two options for an answer which are [favor, against]
and has the input as "I do not have an opinion about this move"
.
If we assume a prompt template as
Input: I do not have an opinion about this move
Output:
what I understand is MetaICL calculates each loss of
Input: I do not have an opinion about this move
Output: favor
and
Input: I do not have an opinion about this move
Output: against
and calculate average losses where favor
and against
are located, and compare the average losses.
However, I am wondering whether it is a fair classification method.
To illustrate, let's say that 'favor'
is tokenized by ['fa', 'vor']
and 'against' is tokenized by ['against']
.
Then, I think that the loss according to the 'vor'
could be heavily affected by prior 'fa'
and may be significantly small.
This can lead to smaller average loss and give 'favor'
more advantage than 'against'
.
I am curious about your comments.
Best regard,
Wookje Han
Hi @wookjeHan,
That's a great question. I think this is possibly a case on some datasets.
In practice, the label that is tokenized into multiple tokens is usually a multi-word label (rather than one word split into multiple BPE tokens), e.g., non-hate
-> non
, -
and hate
. So I think it is less likely that the later tokens have significantly lower losses than the first one.
I just looked at 20 evaluation datasets from class_to_class
split. 10 of 20 are those that contains at least one label split into multiple BPE tokens, and 7 out of 10 are those with multi-word labels. (The other three has entailed
or entailment
that apparently split into multiple tokens.)
(Also note that in your example, favor
won't actually split into multiple tokens because the tokenizer considers favor
as a word rather than favor
. favor
is split into 2 tokens, but favor
is one token.)
I guess another option is to consider the sum of losses assigned to each token, rather than the average. This paper explored that, but found that on average using the average is better than the sum. So we sticked with using the sum.