hsiehjackson/RULER

Why do you use partial match max metric for QA

vkaul11 opened this issue · 1 comments

Just wanted to know why we have https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L25
Why is this different from string_match_all for QA specifically ? Basically if any of the predictions match the reference, it is ok ? I didn't quite understand this well.

 def string_match_part(preds, refs):
    score = sum([max([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) for pred, ref in zip(preds, refs)]) / len(preds) * 100
    return round(score, 2)

string_match_part can get 100% score when matching one of the references; string_match_all should match all of the references to get 100% score. The reason we use string_match_part in QA tasks is because most of the references are paraphrase sentences. Matching one of the references for QA is enough.