web-arena-x/visualwebarena

Assertion error in LLM based fuzzy match

Closed this issue · 1 comments

for config_files/test_reddit/69.json, I get the following error in LLM based fuzzy match metric.

[Unhandled Error] AssertionError('n/a')
Traceback (most recent call last):
  File "/home/pahuja.9/visualwebarena/run.py", line 412, in test
    score = evaluator(
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 626, in __call__
    cur_score = evaluator(trajectory, config_file, page, client)
  File "<@beartype(evaluation_harness.evaluators.HTMLContentExactEvaluator.__call__) at 0x7f992c464790>", line 115, in __call__
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 472, in __call__
    StringEvaluator.fuzzy_match(
  File "<@beartype(evaluation_harness.evaluators.StringEvaluator.fuzzy_match) at 0x7f992c453e20>", line 69, in fuzzy_match
  File "/home/pahuja.9/visualwebarena/evaluation_harness/evaluators.py", line 197, in fuzzy_match
    return llm_fuzzy_match(pred, ref, intent)
  File "<@beartype(evaluation_harness.helper_functions.llm_fuzzy_match) at 0x7f992c452cb0>", line 69, in llm_fuzzy_match
  File "/home/pahuja.9/visualwebarena/evaluation_harness/helper_functions.py", line 609, in llm_fuzzy_match
    assert "correct" in response, response
AssertionError: n/a

I am using the same LLM for fuzzy match as in the original code.

This happens sometimes if gpt-4 returns something other than one of ["correct", "partially correct", "incorrect"]. It can safely ignored since the task would have failed anyway (since the llm didn't return "correct"). You could also replace the assert with another elif to assign a score of 0 if you prefer.