night-chen/ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.
Jupyter NotebookApache-2.0
Issues
- 0
Coffee dataset
#7 opened by xschen-beb - 0
- 0
About the evaluation
#5 opened by zhangzhen-research - 1
About **Programmatic Answer Generation** Part
#4 opened by yc1999 - 1
Expecting the benchmark code!
#3 opened by hzy312 - 2
The release of tool code
#2 opened by hsaest - 1
the question and answer are mismatch
#1 opened by Ericmututu