THUNLP-MT/StableToolBench

Reproduce experimental results.

Taeyoung-Jang opened this issue · 1 comments

Thank you for your great work!

I just ran the evaluation pipeline and checked the pass rates for toolllama v2, gpt3.5-turbo, and gpt4-turbo. However, all the pass rates are significantly lower than the scores presented in the experiment.

I have confirmed that gpt4-turbo is being used both on the server and during the evaluation process. Are there any considerations that should be taken into account during the inference process to obtain results?

I am curious if there are any hyperparameters used to achieve results similar to those obtained in the experiment. (I think there can be an error margin of up to 5% in reproducing the experiment.)

Hi, Thank you for your interest in this work.

We are experiencing two issues currently that may cause the reproducibility problem:

  • Firstly, the real API server maintained by the ToolBench team is faced with instability problems. Many of the calls to real APIs returned 500 as reported by other users. We are investigating this and will hopefully fix it soon. You can double check your replicated trajectories to see whether you are facing this problem.
  • Secondly, the OpenA updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09 and publish our model inference. We are also training our own evaluator model with an open-source model to replace these closed-source models.