Survey: Tool Learning with Large Language Models

Recently, tool learning with large language models(LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems.

This is the collection of papers related to tool learning with LLMs. These papers are organized according to our survey paper "Tool Learning with Large Language Models: A Survey".

中文: We have noticed that PaperAgent and 旺知识 have provided a brief and a comprehensive introduction in Chinese, respectively. We greatly appreciate their assistance.

🎉 Our survey paper is accepted by Frontiers of Computer Science (FCS). The latest version of our paper has already been released; please check it out!

Please feel free to contact us if you have any questions or suggestions!

Contribution

🎉👍 Please feel free to open an issue or make a pull request! 🎉👍

Citation

If you find our work helps your research, please kindly cite our paper:

@article{qu2025tool,
  title={Tool learning with large language models: A survey},
  author={Qu, Changle and Dai, Sunhao and Wei, Xiaochi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Xu, Jun and Wen, Ji-Rong},
  journal={Frontiers of Computer Science},
  volume={19},
  number={8},
  pages={198343},
  year={2025},
  publisher={Springer}
}

📋 Contents

Survey: Tool Learning with Large Language Models

🌟 Introduction

Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the “why” by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of “how”, we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area.

The overall workflow for tool learning with large language models.

📄 Paper List

Why Tool Learning?

Benefit of Tools.

Knowledge Acquisition.
- Search Engine
  
  Internet-Augmented Dialogue Generation, ACL 2022. [Paper]
  
  WebGPT: Browser-assisted question-answering with human feedback, Preprint 2021. [Paper]
  
  Internet-augmented language models through few-shot prompting for open-domain question answering, Preprint 2022. [Paper]
  
  REPLUG: Retrieval-Augmented Black-Box Language Models, Preprint 2023. [Paper]
  
  Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023. [Paper]
  
  ART: Automatic multi-step reasoning and tool-use for large language models, Preprint 2023. [Paper]
  
  ToolCoder: Teach Code Generation Models to use API search tools, Preprint 2023. [Paper]
  
  CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, ICLR 2024. [Paper]
- Database & Knowledge Graph
  
  Lamda: Language models for dialog applications, Preprint 2022. [Paper]
  
  Gorilla: Large Language Model Connected with Massive APIs, NeurIPS 2024. [Paper]
  
  ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings, NeurIPS 2023. [Paper]
  
  ToolQA: A Dataset for LLM Question Answering with External Tools, NeurIPS 2023. [Paper]
  
  Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding, NeurIPS 2023. [Paper]
  
  Middleware for LLMs: Tools are Instrumental for Language Agents in Complex Environments, EMNLP 2024. [Paper]
- Weather or Map
  
  On the Tool Manipulation Capability of Open-source Large Language Models, NeurIPS 2023. [Paper]
  
  ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, Preprint 2023. [Paper]
  
  Tool Learning with Foundation Models, Preprint 2023. [Paper]
Expertise Enhancement.
- Mathematical Tools
  
  Training verifiers to solve math word problems, Preprint 2021. [Paper]
  
  MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning, Preprint 2021. [Paper]
  
  Chaining Simultaneous Thoughts for Numerical Reasoning, EMNLP 2022. [Paper]
  
  Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems, EMNLP 2023. [Paper]
  
  Solving math word problems by combining language models with symbolic solvers, NeurIPS 2023. [Paper]
  
  Evaluating and improving tool-augmented computation-intensive math reasoning, NeurIPS 2023. [Paper]
  
  ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving, ICLR 2024. [Paper]
  
  MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning, Preprint 2024. [Paper]
  
  Calc-CMU at SemEval-2024 Task 7: Pre-Calc -- Learning to Use the Calculator Improves Numeracy in Language Models, NAACL 2024. [Paper]
  
  MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents, Preprint 2024. [Paper]
- Python Interpreter
  
  Pal: Program-aided language models, ICML 2023. [Paper]
  
  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, TMLR 2023. [Paper]
  
  Fact-Checking Complex Claims with Program-Guided Reasoning, ACL 2023. [Paper]
  
  Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, NeurIPS 2023. [Paper]
  
  LeTI: Learning to Generate from Textual Interactions, NAACL 2024. [Paper]
  
  Mint: Evaluating llms in multi-turn interaction with tools and language feedback, ICLR 2024. [Paper]
  
  Executable Code Actions Elicit Better LLM Agents, ICML 2024. [Paper]
  
  CodeNav: Beyond tool-use to using real-world codebases with LLM agents, Preprint 2024. [Paper]
  
  APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model Prompts, Preprint 2024. [Paper]
  
  BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, Preprint 2024. [Paper]
  
  CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges, ACL 2024. [Paper]
  
  MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning, EMNLP 2024. [Paper]
  
  Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering, Preprint 2024. [Paper]
- Others
  
  MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting, ACL 2023. [Paper]
  
  ChemCrow: Augmenting large-language models with chemistry tools, Nature Machine Intelligence 2024. [Paper]
  
  A REVIEW OF LARGE LANGUAGE MODELS AND AUTONOMOUS AGENTS IN CHEMISTRY, Preprint 2024. [Paper]
  
  GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information, ISMB 2024. [Paper]
  
  Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance, EACL 2024. [Paper]
  
  Simulating Financial Market via Large Language Model based Agents, Preprint 2024. [Paper]
  
  A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist, KDD 2024. [Paper]
  
  AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning, Preprint 2024. [Paper]
  
  SCIAGENT: Tool-augmented Language Models for Scientific Reasoning, EMNLP 2024. [Paper]
  
  MMedAgent: Learning to Use Medical Tools with Multi-modal Agent, EMNLP 2024 Findings. [Paper]
  
  Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning, SIGIR 2024. [Paper]
  
  DOMAIN-SPECIFIC ReAct FOR PHYSICS-INTEGRATED ITERATIVE MODELING: A CASE STUDY OF LLM AGENTS FOR GAS PATH ANALYSIS OF GAS TURBINES, Preprint 2024. [Paper]
  
  WORLDAPIS: The World Is Worth How Many APIs? A Thought Experiment, ACL 2024 Workshop. [Paper]
  
  Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios, Preprint 2024. [Paper]
  
  HoneyComb: A Flexible LLM-Based Agent System for Materials Science, Preprint 2024. [Paper]
  
  MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling, NAACL 2025. [Paper]
  
  ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents, Preprint 2024. [Paper]
  
  TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM, COLING 2025. [Paper]
Automation and Efficiency.
- Schedule Tools
  
  ToolQA: A Dataset for LLM Question Answering with External Tools, NeurIPS 2023. [Paper]
- Set Reminders
  
  ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]
- Filter Emails
  
  ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]
- Project Management
  
  ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]
- Online Shopping Assistants
  
  WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, NeurIPS 2022. [Paper]
Interaction Enhancement.
- Multi-modal Tools
  
  Vipergpt: Visual inference via python execution for reasoning, ICCV 2023. [Paper]
  
  MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Preprint 2023. [Paper]
  
  InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language, Preprint 2023. [Paper]
  
  AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn, Preprint 2023. [Paper]
  
  LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents, Preprint 2024. [Paper]
  
  CLOVA: A closed-loop visual assistant with tool usage and update, CVPR 2024. [Paper]
  
  DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model, CVPR 2024. [Paper]
  
  MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning, Preprint 2024. [Paper]
  
  m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks, Preprint 2024. [Paper]
  
  From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis, Preprint 2024. [Paper]
- Machine Translator
  
  Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023. [Paper]
  
  Tool Learning with Foundation Models, Preprint 2023. [Paper]
- Natural Language Processing Tools
  
  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, NeurIPS 2023. [Paper]
  
  GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension, Preprint 2023. [Paper]

Benefit of Tool Learning.

Enhanced Interpretability and User Trust.
Improved Robustness and Adaptability.

How Tool Learning?

Task Planning.

Tool-Integrated Reasoning

START: Self-taught Reasoner with Tools, Preprint 2025. [Paper]

ToolRL: Reward is All Tool Learning Needs, Preprint 2025. [Paper]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs, Preprint 2025. [Paper]

OTC: Optimal Tool Calls via Reinforcement Learning, Preprint 2025. [Paper]

Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use, Preprint 2025. [Paper]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning, Preprint 2025. [Paper]

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning, Preprint 2025. [Paper]

CoRT: Code-integrated Reasoning within Thinking, Preprint 2025. [Paper]

Agentic Reinforced Policy Optimization, Preprint 2025. [Paper]

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning, Preprint 2025. [Paper]
Tuning-free Methods

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022. [Paper]

ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023. [Paper]

ART: Automatic multi-step reasoning and tool-use for large language models, Preprint 2023. [Paper]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, NeurIPS 2023. [Paper]

Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT, Preprint 2023. [Paper]

Large Language Models as Tool Makers, ICLR 2024. [Paper]

CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models, EMNLP 2023. [Paper]

ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models, EMNLP 2023. [Paper]

FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, Preprint 2023. [Paper]

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage, Preprint 2023. [Paper]

ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search, ICLR 2024. [Paper]

Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use, ACL 2024. [Paper]

TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks, Preprint 2024. [Paper]

SwissNYF: Tool Grounded LLM Agents for Black Box Setting, Preprint 2024. [Paper]

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs, Preprint 2024. [Paper]

Budget-Constrained Tool Learning with Planning, ACL 2024 Findings. [Paper]

Planning and Editing What You Retrieve for Enhanced Tool Learning, NAACL 2024. [Paper]

Large Language Models Can Plan Your Travels Rigorously with Formal Verification Tools, Preprint 2024. [Paper]

Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning, Preprint 2024. [Paper]

STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making, Preprint 2024. [Paper]

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner, Preprint 2024. [Paper]

Can Graph Learning Improve Planning in LLM-based Agents?, NeurIPS 2024. [Paper]

Tool-Planner: Dynamic Solution Tree Planning for Large Language Model with Tool Clustering, Preprint 2024. [Paper]

Tools Fail: Detecting Silent Errors in Faulty Tools, EMNLP 2024. [Paper]

What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks, Preprint 2024. [Paper]

Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries, Preprint 2024. [Paper]

Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases, Preprint 2024. [Paper]

From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions, ICLR 2025. [Paper]
Tuning-based Methods

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs, INTELLIGENT COMPUTING 2024. [Paper]

OpenAGI: When LLM Meets Domain Experts, Neurips 2023. [Paper]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]

Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model, Preprint 2023. [Paper]

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems, ICLR 2024. [Paper]

Navigating Uncertainty: Optimizing API Dependency for Hallucination Reduction in Closed-Book Question Answering, ECIR 2024. [Paper]

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent, EMNLP 2024. [Paper]

Efficient Tool Use with Chain-of-Abstraction Reasoning, Preprint 2024. [Paper]

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models, Preprint 2024. [Paper]

A Solution-based LLM API-using Methodology for Academic Information Seeking, Preprint 2024. [Paper]

Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees, NeurIPS 2024. [Paper]

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, Preprint 2024. [Paper]

MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation, Preprint 2024. [Paper]

ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback, EMNLP 2024. [Paper]

Learning Evolving Tools for Large Language Models, ICLR 2025. [Paper]

StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs, Preprint 2024. [Paper]

ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis, Preprint 2024. [Paper]

Meta-Reasoning Improves Tool Use in Large Language Models, Preprint 2024. [Paper]

Teaching LLMs to Refine with Tools, Preprint 2024. [Paper]

Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt, Preprint 2024. [Paper]

Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning, Preprint 2025. [Paper]

Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Preprint 2025. [Paper]

Tool Selection.

Retriever-based Tool Selection

A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 1972. [Paper]

The probabilistic relevance framework: BM25 and beyond, Foundations and Trends in Information Retrieval 2009. [Paper]

Sentence-bert: Sentence embeddings using siamese bert-networks, EMNLP 2019. [Paper]

Approximate nearest neighbor negative contrastive learning for dense text retrieval, ICLR 2021. [Paper]

Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling, SIGIR 2021. [Paper]

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval, ACL 2022. [Paper]

Unsupervised dense information retrieval with contrastive learning, Preprint 2021. [Paper]

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets, ICLR 2024. [Paper]

ProTIP: Progressive Tool Retrieval Improves Planning, Preprint 2023. [Paper]

ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval, COLING 2024. [Paper]

Enhancing Tool Retrieval with Iterative Feedback from Large Language Models, EMNLP 2024 Findings. [Paper]

Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval, EMNLP 2024 Findings. [Paper]

Efficient and Scalable Estimation of Tool Representations in Vector Space, Preprint 2024. [Paper]

Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases, Preprint 2024. [Paper]

Towards Completeness-Oriented Tool Retrieval for Large Language Models, CIKM 2024. [Paper]

Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases, Preprint 2024. [Paper]

Graph RAG-Tool Fusion, Preprint 2025. [Paper]

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models, Preprint 2025. [Paper]
LLM-based Tool Selection

On the Tool Manipulation Capability of Open-source Large Language Models, Preprint 2023. [Paper]

Making Language Models Better Tool Learners with Execution Feedback, NAACL 2024. [Paper]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]

Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum, AAAI 2024. [Paper]

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Preprint 2024. [Paper]

TOOLVERIFIER: Generalization to New Tools via Self-Verification, EMNLP 2024 Findings. [Paper]

ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph, Preprint 2024. [Paper]

GeckOpt: LLM System Efficiency via Intent-Based Tool Selection, GLSVLSI 2024. [Paper]

AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval, NeurIPS 2024. [Paper]

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector, Preprint 2024. [Paper]

Adaptive Selection for Homogeneous Tools: An Instantiation in the RAG Scenario, EMNLP 2024 Findings. [Paper]

ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities, Preprint 2024. [Paper]

ToolGen: Unified Tool Retrieval and Calling via Generation, ICLR 2025. [Paper]

Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option, Preprint 2024. [Paper]

EcoAct: Economic Agent Determines When to Register What Action, Preprint 2024. [Paper]

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation, Preprint 2024. [Paper]

Reducing Tool Hallucination via Reliability Alignment, Preprint 2024. [Paper]

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use, Preprint 2024. [Paper]

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model, Preprint 2025. [Paper]

Tool Unlearning for Tool-Augmented LLMs, Preprint 2025. [Paper]

PEToolLLM: Towards Personalized Tool Learning in Large Language Models, Preprint 2025. [Paper]

GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation, Preprint 2025. [Paper]

Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity, Preprint 2025. [Paper]

From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions, ICLR 2025. [Paper]

Tool Calling.

Tuning-free Methods

RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Preprint 2023. [Paper]

Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning, Preprint 2023. [Paper]

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution, EACL 2023. [Paper]

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models, Preprint 2023. [Paper]

ControlLLM: Augment Language Models with Tools by Searching on Graphs, Preprint 2023. [Paper]

EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction, Preprint 2024. [Paper]

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling, ACL 2024. [Paper]

Concise and Precise Context Compression for Tool-Using Language Models, ACL 2024 Findings. [Paper]

AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation, ACL 2024 Findings. [Paper]
Tuning-based Methods

Gorilla: Large Language Model Connected with Massive APIs, NeurIPS 2024. [Paper]

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction, NeurIPS 2023. [Paper]

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings, NeurIPS 2023. [Paper]

Tool-Augmented Reward Modeling, ICLR 2024. [Paper]

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, ACL 2024. [Paper]

ToolACE: Winning the Points of LLM Function Calling, ICLR 2025. [Paper]

CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance, Preprint 2024. [Paper]

Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs, EMNLP 2024. [Paper]

Asynchronous LLM Function Calling, Preprint 2024. [Paper]

Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation, Preprint 2025. [Paper]

Self-Training Large Language Models for Tool-Use Without Demonstrations, Preprint 2025. [Paper]

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training, Preprint 2025. [Paper]

Response Generation.

Direct Insertion Methods

TALM: Tool Augmented Language Models, Preprint 2022. [Paper]

Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS 2023. [Paper]

A Comprehensive Evaluation of Tool-Assisted Generation Strategies, EMNLP 2023. [Paper]
Information Integration Methods

TPE: Towards Better Compositional Reasoning over Conceptual Tools with Multi-persona Collaboration, Preprint 2023. [Paper]

RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation, ICLR 2024. [Paper]

Learning to Use Tools via Cooperative and Interactive Agents, EMNLP 2024 Findings. [Paper]

Benchmarks and Evaluation.

Benchmarks

Benchmark	Reference	Description	#Tools	#Instances	Link	Release Time
API-Bank	[Paper]	Assessing the existing LLMs’ capabilities in planning, retrieving, and calling APIs.	73	314	[Repo]	2023-04
APIBench	[Paper]	A comprehensive benchmark constructed from TorchHub, TensorHub, and HuggingFace API Model Cards.	1,645	16,450	[Repo]	2023-05
ToolBench1	[Paper]	A tool manipulation benchmark consisting of diverse software tools for real-world tasks.	232	2,746	[Repo]	2023-05
ToolAlpaca	[Paper]	Evaluating the ability of LLMs to utilize previously unseen tools without specific training.	426	3,938	[Repo]	2023-06
RestBench	[Paper]	A high-quality benchmark which consists of two real-world scenarios and human-annotated instructions with gold solution paths.	94	157	[Repo]	2023-06
ToolBench2	[Paper]	An instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT.	16,464	126,486	[Repo]	2023-07
MetaTool	[Paper]	A benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.	199	21,127	[Repo]	2023-10
TaskBench	[Paper]	A benchmark designed to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction.	103	28,271	[Repo]	2023-11
T-Eval	[Paper]	Evaluating the tool-utilization capability step by step.	15	533	[Repo]	2023-12
ToolEyes	[Paper]	A fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios.	568	382	[Repo]	2024-01
UltraTool	[Paper]	A novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios.	2,032	5,824	[Repo]	2024-01
API-BLEND	[Paper]	A large corpora for training and systematic testing of tool-augmented LLMs.	-	189,040	[Repo]	2024-02
Seal-Tools	[Paper]	Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings.	4,076	14,076	[Repo]	2024-05
ToolQA	[Paper]	It is designed to faithfully evaluate LLMs’ ability to use external tools for question answering.(QA)	13	1,530	[Repo]	2023-06
ToolEmu	[Paper]	A framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios.(Safety)	311	144	[Repo]	2023-09
ToolTalk	[Paper]	A benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue.(Conversation)	28	78	[Repo]	2023-11
VIoT	[Paper]	A benchmark include a training dataset and established performance metrics for 11 representative vision models, categorized into three groups using semi-automated annotations.(VIoT)	11	1,841	[Repo]	2023-12
RoTBench	[Paper]	A multi-level benchmark for evaluating the robustness of LLMs in tool learning.(Robustness)	568	105	[Repo]	2024-01
MLLM-Tool	[Paper]	A system incorporating open-source LLMs and multimodal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the functionmatched tool correctly.(Multi-modal)	932	11,642	[Repo]	2024-01
ToolSword	[Paper]	A comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning.(Safety)	100	440	[Repo]	2024-02
SciToolBench	[Paper]	Spanning five scientific domains to evaluate LLMs’ abilities with tool assistance.(Sci-Reasoning)	2,446	856	[Repo]	2024-02
InjecAgent	[Paper]	A benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.(Safety)	17	1,054	[Repo]	2024-02
StableToolBench	[Paper]	A benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system.(Stable)	16,464	126,486	[Repo]	2024-03
m&m's	[Paper]	A benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, public APIs, and image processing modules.(Multi-modal)	33	4,427	[Repo]	2024-03
GeoLLM-QA	[Paper]	A novel benchmark of 1,000 diverse tasks, designed to capture complex RS workflows where LLMs handle complex data structures, nuanced reasoning, and interactions with dynamic user interfaces.(Remote Sensing)	117	1,000	[Repo]	2024-04
ToolLens	[Paper]	ToolLens includes concise yet intentionally multifaceted queries that better mimic real-world user interactions. (Tool Retrieval)	464	18,770	[Repo]	2024-05
SoAyBench	[Paper]	A Solution-based LLM API-using Methodology for Academic Information Seeking	7	792	[Repo], [HF]	2024-05
ToolBH	[Paper]	A benchmark that assesses the LLM’s hallucinations through two perspectives: depth and breadth.	-	700	[Repo]	2024-06
ShortcutsBench	[Paper]	A Large-Scale Real-world Benchmark for API-based Agents	1414	7627	[Repo]	2024-07
GTA	[Paper]	A Benchmark for General Tool Agents	14	229	[Repo]	2024-07
WTU-Eval	[Paper]	A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models	4	916	[Repo]	2024-07
AppWorld	[Paper]	A collection of complex everyday tasks requiring interactive coding with API calls	457	750	[Repo]	2024-07
ToolSandbox	[Paper]	A stateful, conversational and interactive tool-use benchmark.	34	1032	[Repo]	2024-08
CToolEval	[Paper]	A benchmark designed to evaluate LLMs in the context of Chinese societal applications.	27	398	[Repo]	2024-08
NoisyToolBench	[Paper]	This benchmark includes a collection of provided APIs, ambiguous queries, anticipated questions for clarification, and the corresponding responses.	-	200	[Repo]	2024-09
NESTOOLS	[Paper]	A dataset for evaluating nested tool learning abilities of Large Language Models.	3034	1000	[Repo]	2024-10
MTU-Bench	[Paper]	A multi-granularity tool-Use benchmark for Large Language Models.	136	159061	[[Repo]](https: //github.com/MTU-Bench-Team/MTU-Bench.git)	2024-10
ACEBench	[Paper]	This system is meticulously designed to encompass a wide spectrum of function calling scenarios.	4538	-	[Repo]	2025-01
ToolHop	[Paper]	Designed to assess LLMs’ ability to use tools in multi-hop scenarios.	3912	995	[[Repo]](https://huggingface.co/ datasets/bytedance-research/ToolHop.)	2025-01
ToolComp	[Paper]	A comprehensive benchmark designed to evaluate multi-step tool-use reasoning.	11	485	[Repo]	2025-01
ToolRet	[Paper]	The first evaluation benchmark for tool retrieval tasks.	43000	7600	[Repo]	2025-03
ToolDial	[Paper]	A dataset comprising 11,111 multiturn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI.	-	11111	[Repo]	2025-03

Evaluation

Task Planning
- Tool Usage Awareness
  
  MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use, ICLR 2024. [Paper]
  
  Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?, Preprint 2024. [Paper]
- Pass Rate & Win Rate
  
  ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, ICLR 2024. [Paper]
- Accuracy
  
  T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step, ACL 2024. [Paper]
  
  RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Preprint 2023. [Paper]
  
  A Solution-based LLM API-using Methodology for Academic Information Seeking, Preprint 2024. [Paper]
Tool Selection
- Precision
  
  ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents, Preprint 2024. [Paper]
- Recall
  
  Recall, precision and average precision,Department of Statistics and Actuarial Science 2004. [Paper]
- NDCG
  
  Cumulated gain-based evaluation of IR techniques, TOIS 2002. [Paper]
- COMP
  
  COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models, CIKM 2024. [Paper]
Tool Calling
- Consistent with stipulations
  
  T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step, ACL 2024. [Paper]
  
  Planning and Editing What You Retrieve for Enhanced Tool Learning, NAACL 2024. [Paper]
  
  ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios, Preprint 2024. [Paper3]
  
  ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents, Preprint 2024. [Paper]
Response Generation
- BLEU
  
  Bleu: a Method for Automatic Evaluation of Machine Translation, ACL 2002. [Paper]
- ROUGE
  
  Rouge: A package for automatic evaluation of summaries, ACL 2004. [Paper]
- Exact Match
  
  cem: Coarsened exact matching in Stata, The Stata Journal 2009. [Paper]
Parameter Filling
- Precision
  
  ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents, Preprint 2024. [Paper]

Challenges and Future Directions

High Latency in Tool Learning
Rigorous and Comprehensive Evaluation
Comprehensive and Accessible Tools
Safe and Robust Tool Learning
Unified Tool Learning Framework
Real-Word Benchmark for Tool Learning
Tool Learning with Multi-Modal

Other Resources

Awesome Lists

ToolLearningPapers. [Repo]

awesome-tool-llm. [Repo]

awesome-llm-tool-learning. [Repo]
Other Surveys

Augmented Language Models: a Survey, TMLR 2024. [Paper]

Tool Learning with Foundation Models, Preprint 2024. [Paper]

What Are Tools Anyway? A Survey from the Language Model Perspective, COLM 2024. [Paper]

quchangle1/LLM-Tool-Survey