关于review metrics

Question

Closed this issue 3 months ago · 1 comments

想请教作者下面的问题，非常感谢您的回答:-)

本文的目的是分别评估工具使用的各方面能力，为什么会出現同时评估REASON, RETRIEVE, UNDERSTAND呢？（ref C.3）
在review的时候，“Given a thought ti and a tool response oi , the LLM is required to evaluate the tool response.” 这样判断的结果似乎不可靠？比如：
thought不符合human query要求、tool response符合thought 要求的时候，会判定为Success，但其实not Success（因为没有solve human query）。

Answer 1 · 2024-01-15T02:30:39.000Z