Multi-sign evaluation

Question

Multi-sign evaluation

Opened this issue a year ago · 0 comments

There are use cases where both the hypothesis and the reference are of multiple signs.
This can happen when we want to transcribe multiple signs continuously, or even translate sentences.

For these cases, our BLEU and chrF metrics work out-of-the-box as expected, however our Similarity and CLIPScore metrics do not.

I propose, that when a metric is initialized, it would have a parameter order_penalty that when equal to 0 has no penalty for reordering, and 1 has maximum penalty.

When the score method receives a sequence of signs:

if order_penalty == 1, scores each sign in the hypothesis against the relevant sign in the reference, and averages the results. If their length is different, we pad with 0 scores
if order_penalty < 1, we score each sign in the hypothesis against all signs in the reference score_all, then we weigh the matrix such that we do not change the diagonal, but the farther away two indexes are from each other - (1,3) for example - it would get more punishment, then, we find the optimal matching and average the results.

The case of order_penalty == 1 fits for transcription, where it doesn't make any sense to reorder the signs. Other cases are for translation, where some reordering is possible.