Could you release the reproduction data for your result
Opened this issue · 2 comments
I'm testing the pass rate evaluation, could you offer the reproduction data like Toolbench?
Thanks for your reply
Hi! Thank you for your interest in our work.
We are planning to publish our model inference results soon. However, OpenAI updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09. We are also training our own evaluator model with an open-source model to replace these closed-source models.
Hi, thanks for your great job of StableToolBench. Is there any update on the release plan of model inference results?
I'm working on StableToolBench to build benchmark with other evaluation metrics. However, it's expensive to rerun all the model inference results. While the evaluation setup may change, is it possible to release the inference results first? It seems that the inference results will be always consistent during the whole evaluation process.