THUNLP-MT/StableToolBench

Correctness of API Simulator

xuanz20 opened this issue · 1 comments

I read the paper and it says in the virtual API system, when real API(RapidAPI) is not available, the system will call simulated API. So how to ensure the correctness of the result from simulated API?
Or do we actually want a result that "looks" right so that we can evaluate LLM's ability to use tools and not care the correctness of its result?

Thank you for your question.

We believe that exact replication of real API outputs is not necessary for API simulators; rather, the focus should be on providing rational responses. For instance, when querying today's weather, an API simulator need not fetch the actual temperature. Instead, it should produce a plausible temperature number. The term "correctness" may not be entirely appropriate here since any reasonable temperature can be deemed correct.

In our paper, we conduct a "Turing test" to illustrate that outputs from LLM-based simulations are virtually indistinguishable from real API responses, and that the diversity of these simulations mirrors that of actual APIs.

I concur that evaluating the tool learning capabilities of LLMs is crucial as a benchmark. Nevertheless, it is also important to ensure that these simulations operate within a realistic—or nearly realistic—and reliable framework. Hence, we carefully test the verisimilitude of our simulations.