web-arena-x/webarena

fuzzy match gives the wrong answer in eval

Opened this issue · 1 comments

for task 361:

our agent gave the answer: Order number 170 is Canceled, order number 189 is Pending

the evaluator is using fuzzy match and evaluated our answer as wrong:

        "eval_types": [
            "string_match"
        ],
        "reference_answers": {
            "fuzzy_match": [
                "170: cancelled",
                "189: pending"
            ]
        },
        "reference_url": "",
        "program_html": [],
        "string_note": "",
        "reference_answer_raw_annotation": "170: cancelled, 189: pending"
    },

In my opinion, line 165 of 'StringEvaluator' in evaluation_harness.evaluator_router should be revised to:

assert isinstance(value, list)
score *= self.fuzzy_match(
    ref=" ".join(value), pred=pred, intent=intent
)

The original code will compare each individual item in the 'fuzzy_match' list with the prediction, but the prediction should be compared with the whole list.