night-chen/ToolQA

About the evaluation

Opened this issue · 0 comments

Hello,we are trying to replicate your work, but we haven't obtained the results reported in your paper on the GSM8k easy dataset. Upon reviewing the output files, we noticed that you used the evaluation metric of exact match (EM), where answers such as 4 and 4.0, or 50$ and 50, are considered incorrect. Have you encountered these situations during your work?