WHGTyen/BIG-Bench-Mistake

`target` != `answer` but `mistake_index` is none

Closed this issue · 1 comments

I find 18 samples in multistep_arithmetic.jsonl that sample["target"] != sample["answer"], but sample["mistake_index"] = None.

Here are two examples:

{'input': '(((-5 + -9 - 0 + 8) + (5 + 2 + 2 - -9)) - ((-6 + 5 - -9 * -8) + (2 - 0 - 5 + 0))) =', 'steps': ['This equation can be written as "(A + B) - (C + D)", where A = ((-5 + -9 - 0 + 8) + (5 + 2 + 2 - -9)) and B = ((-6 + 5 - -9 * -8) + (2 - 0 - 5 + 0)).', "Let's calculate A = ((-5 + -9 - 0 + 8) + (5 + 2 + 2 - -9)) = ((-5 - 9 + 8) + (5 + 2 + 2 + 9)) = (-6) + (18) = 12.", "Let's calculate B = ((-6 + 5 - -9 * -8) + (2 - 0 - 5 + 0)) = ((-6 + 5 - 72) + (2 - 5)) = (-67) + (-3) = -70.", 'Then, the final equation is (A + B) - (C + D) = (12 + -70) - (-70) = -58 - (-70) = 12. So the answer is 12'], 'target': '88', 'answer': '12', 'mistake_index': None}

In the example above, the 0-th step should be incorrect because A should be (-5 + -9 - 0 + 8) instead of ((-5 + -9 - 0 + 8) + (5 + 2 + 2 - -9)) and B should be (5 + 2 + 2 - -9) instead of ((-6 + 5 - -9 * -8) + (2 - 0 - 5 + 0)).

{'input': '(((-3 + -2 * 5 + 3) - (-5 + -3 - -8 * -3)) - ((-2 + 8 - 6 - 2) + (-5 - -4 * -3 - 8))) =', 'steps': ['This equation can be written as "A - B - C - D", where A = (-3 + -2 * 5 + 3), B = (-5 + -3 - -8 * -3), C = (-2 + 8 - 6 - 2), and D = (-5 - -4 * -3 - 8).', "Let's calculate A = (-3 + -2 * 5 + 3) = (-3 + (-2 * 5) + 3) = (-3 + (-10) + 3) = (-3 - 10 + 3) = -10.", "Let's calculate B = (-5 + -3 - -8 * -3) = (-5 + -3 - (-8 * -3)) = (-5 + -3 - (24)) = (-5 + -3 - 24) = -32.", "Let's calculate C = (-2 + 8 - 6 - 2) = (-2 + 8 - 6 - 2) = (8 - 10) = -2.", "Let's calculate D = (-5 - -4 * -3 - 8) = (-5 - (-4 * -3) - 8) = (-5 - (12) - 8) = (-5 - 12 - 8) = -25.", 'Then, the final equation is A - B - C - D = -10 - (-32) - (-2) - (-25) = 15. So the answer is 15'], 'target': '49', 'answer': '15', 'mistake_index': None}

In the example above, the last step should be incorrect because the result of -10 - (-32) - (-2) - (-25) is 49 instead of 15.

Are there labeling errors in these samples?

Hello, apologies for not seeing this before the holiday break. The data for Multistep Arithmetic is entirely human annotated, so there will be some errors, and this appears to be one unfortunately. We also noticed a lower agreement score (see Table 2 in our paper) for Multistep Arithmetic compared to Tracking Shuffled Objects and Logical Deduction, likely because the task requires careful attention to detail and has more room for error.

Closing for now, please let me know if you have further questions.