In the era of large language models (LLMs), LLM-based agents have shown remarkable performance in several existing benchmarks or real-world applications. Nevertheless, they still face difficulties in tackling complex tasks. Inspired by collaborative problem solving, several recent works use the strategy of multi-agent collaboration as a potential solution.
We collect the Must-read papers to catch up and share the state-of-the-art methods, facilitating the related research.
- Dataset: HotpotQA, FEVER, ALFWorld, WebShop
Link: more previous works can be found in:
Thanks a lot for pioneering effort.
- [2023/10] Metaagents: Simulating Interactions Of HuMan Behaviors For Llm-Based Task-Oriented Coordination Via Collaborative Generative Agents (Lehigh University)[paper]
- task: Task-oriented Social
- [2023/10] GameGPT: Multi-agent Collaborative Framework For Game Development (AutoGame Research)[paper]
- task: Coding, Game Development, Multi-Agent cooperation
- [2023/10] Evaluating Multi-agent Coordination Abilities In Large Language Models (University of California, Santa Cruz) [paper]
- task: Multi-agent coordination, LLM-ToM-Reasoning
- [2023/10] Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models [paper] [code]
- task: Visual Semantic Navigation
- Dataset: HM3D
- [2023/10] Dynamic Llm-Agent Network:An Llm-Agent Collaboration Framework With Agent Team Optimization[paper]
- task:arithmetic reasoning, general reasoning, code generation.
- Dataset:MATH, MMLU, HumanEval
- [2023/10] Multi-agent Consensus Seeking Via Large Language Models (Westlake University)[paper]
- task: Reasoning
- [2023/10] Exploring Collaboration Mechanisms For Llm Agents: A Social Psychology View (National University of Singapore, NUS-NCS Joint Lab) [paper]
- task: Multi-agent cooperation
- Dataset: MMLU, MATH, BIG-Bench Benchmark
- [2023/10] Corex: Pushing The Boundaries Of Complex Reasoning Through Multi-Model Collaboration[paper][code]
- task:Reasoning
- Dataset:GSM8K, MultiArith, SingleOP/SingleEQ, AddSub, AQuA, SVAMP,GSMHard,StrategyQA, CommonsenseQA, BoolQ ,AI2 Reasoning Challenge (ARC-c),BigBench,FinQA, ConvFinQA, TAT-QA
- [2023/10] Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game[paper]
- task:Werewolf game
- [2023/10] AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems (Gaoling School of Artificial Intelligence, Renmin University of China)[paper]
- task: Recommendation
- Dataset: CDs and Vinyl, Office Products
- [2023/10] Agentverse: Facilitating Multi-Agent Collaboration And Exploring Emergent Behaviors[paper][code]
- task:Conversation, Mathematical Calculation, Logical Reasoning, Coding
- Dataset:FED, Commongen-Challenge, MGSM, BigBench, Humaneval
- [2023/10] Large Language Models Can Design Gametheoretic Objectives For Multi-Agent Planning[paper]
- task: Embodied Intelligence
- Dataset:ThreeDWorld Transport Challenge
- [2023/10] Communicative Agents For Software Development (Tsinghua University) [paper]
- task: Coding
- Dataset: Camel
- task: Math(LP)
- Dataset: LPWP, ComplexOR
- [2023/09] OKR-Agent: An Object And Key Results Driven Agent System With Hierarchical Self-Collaboration And Self-Evaluation[paper]
- task: Storyboard Generation, Creative Writing, Trip Planning
- Dataset: (case study)
- [2023/09] Reason To Behave: Achieving Human-Level Task Execution For Physics-Based Characters[paper][code]
- task: Path Planning
- Dataset: MoCap
- [2023/09] AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems (Gaoling School of Artificial Intelligence, Renmin University of China)[paper]
- task: Recommendation
- Dataset: CDs and Vinyl, Office Products
- [2023/09] Adapting Llm Agents Through Communication [paper]
- task:Path Planning, QA,Math reasoning
- Dataset:ALFWorld, HotpotQA, GSM8k
- task:Open-ended Question Answer task,Trivia Creative Writing
- Dataset: MT-bench
- [2023/09] Metagpt: Meta Programming For A Multi-Agent Collaborative Framework[paper]
- task:Coding
- Dataset:HumanEval, MBPP, SoftwareDev
- [2023/09] Oceangpt: A Large Language Model For Ocean Science Tasks[paper]
- task:Ocean-related Task
- Dataset: open-access literature,OCEANBENCH
-
[2023/09] Playing Repeated Games With Large Language Models[[paper](https://openreview.
-
[2023/09] Playing Repeated Games With Large Language Models[paper]
- task:cooperation and coordination games.
- [2023/09] Chateval: Towards Better Llm-Based Evaluators Through Multi-Agent Debate[paper]
- task:QA
- Dataset:FairEval, Topical-Chat
- [2023/09] Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game[paper]
- task:Werewolf game
- [2023/09] Mindagent: Emergent Gaming Interaction[paper]
- task:Planning,Coordination
- Dataset: Cuisine World
- task: Planning, Conversation, Cooperation
- Dataset:ThreeDWorld Multi-Agent Transport (TDW-MAT)
- [2023/09] Autoagent: Enabling Next-Gen Llm Applications Via Multi-Agent Conversation (Microsoft Research) [paper][code]
- task:Math, QA, Decision, Coding, Chat, Chess
- Dataset: MATH, Natural Questions, ALFWorld
- [2023/09] Evaluating Multi-agent Coordination Abilities In Large Language Models (University of California, Santa Cruz) [paper]
- task: Multi-agent coordination, LLM-ToM-Reasoning
- [2023/08] AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. (Microsoft Research) [paper] [code]
- task: Multi-agent Cooperation, Conversation, MMLU
- Dataset: MATH, Natural Questions, ALFWorld, OptiGuide
- [2023/08] Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. (University of Illinois Urbana-Champaign) [paper] [code]
- task: Cognitive synergy
- Dataset: BigBench, TriviaQA
- [2023/08] CGMI: Configurable General Multi-Agent Interaction Framework. (East China Normal University) [paper]
- task: Replicate human interactions in real-world scenarios
- [2023/08] ProAgent: Building Proactive Cooperative AI with Large Language Models. (Institute for Artificial Intelligence, Peking University) [paper] [code]
- task:Cooperative Reasoning, Planning
- Dataset: Overcooked-AI
- [2023/07] RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. (Columbia University) [paper] [code]
- task: Communication, Path Planning, Reasoning
- Dataset: RoCoBench
- [2023/07] Communicative Agents For Software Development (Tsinghua University) [paper]
- task: Coding
- Dataset: Camel
- [2023/06] When Large Language Model Based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm (Gaoling School of Artificial Intelligence Renmin University of China, Beijing, China)[paper]
- task: User Simulation
- Dataset: RecAgent
- [2023/06] Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. (University of Alberta) [paper]
- task: Multi-Agent coordination
- [2023/05] Training Socially Aligned Language Models in Simulated Human Society. (Dartmouth College) [paper] [code]
- task: Learn From Simulated Social Interactions
- Dataset: Anthropic RLHF
- [2023/05] SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. (Allen Institute for Artificial Intelligence) [paper] [code]
- task: Reasoning, Path Planning
- Dataset: ScienceWorld
- [2023/05] ChatGPT as your Personal Data Scientist. (Auburn University) [paper]
- task: AutoML
- Dataset: UCI Machine Learning Repository, Cora
- [2023/05] Agents: An Open-source Framework for Autonomous Language Agents. (ETH Zürich) [paper] [code]
- task:Planning, Tool Usage, Multi-Agents communication
- [2023/05] Improving Factuality and Reasoning in Language Models through Multiagent Debate. (Google Brain) [paper] [code]
- task: Mathematical Reasoning, Strategic Reasoning
- Dataset: GSM8K, MMLU
- [2023/05] Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. (University of Edinburgh) [paper] [code]
- task: Autonomously Improve
- [2023/05] Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate. (Research Center for Social Computing and Information Retrieval Harbin Institute of Technology, China) [paper]
- task: Multi-Agents Coordination
- Dataset: αNLI, CSQA, COPA, e-CARE,Social IQa, PIQA, StrategyQA
- [2023/05] SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. (Allen Institute for Artificial Intelligence) [paper] [code]
- task: Reasoning, Path Planning
- Dataset: ScienceWorld
- [2023/05] ChatGPT as your Personal Data Scientist. (Auburn University) [paper]
- task: AutoML
- Dataset: UCI Machine Learning Repository, Cora
- [2023/05] Agents: An Open-source Framework for Autonomous Language Agents. (ETH Zürich) [paper] [code]
- task:Planning, Tool Usage, Multi-Agents communication
- [2023/05] Improving Factuality and Reasoning in Language Models through Multiagent Debate. (Google Brain) [paper] [code]
- task: Mathematical Reasoning, Strategic Reasoning
- Dataset: GSM8K, MMLU
- [2023/05] Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. (University of Edinburgh) [paper] [code]
- task: Autonomously Improve
- [2023/05] Training Socially Aligned Language Models in Simulated Human Society. (Dartmouth College) [paper] [code]
- task: Learn From Simulated Social Interactions
- Dataset: Anthropic RLHF
- [2023/01] Blind Judgement: Agent-Based Supreme Court Modelling With GPT. (McGill University) [paper]
- task: Reasoning, Prediction
- Dataset: SCDB
We gather information on commonly used datasets for reference. Please be aware that there may be slight difference in the dataset due to different versions.
Name (link) | Task | Number | Evaluation* | Paper |
---|---|---|---|---|
Hotpot-QA | open-domain QA | train/dev/test: 88k/5.6k/5.6k | Exactly Match (EM) | paper |
mmlu | multiple-choice questions | train/dev/test: 99.8k/285/1.531k | Multitask Accuracy | paper |
math | reasoning | 1.25k | Exactly Match(EM) | paper |
ALFWorld | Embodied AI | 3.5k// | Generalization | paper |
Natural Questions | QA | 30.7k//0.78k | Exactly Match(EM) | paper |
GSM8K | reasoning | 7.5k//1.062k | Exactly Match(EM) | paper |
HumanEval | coding | 164 handwritten programming questions | Correctness | paper |
BigBench | coding | 214 tasks | Correctness, Fluency | paper |
AI2 Reasoning Challenge | choice question | 3.37k/0.87k/3.55k | Correctness | paper |
MGSM | Math | 8/0.25k | Exactly Match(EM) | paper |
FairEval | llm evaluation | 80 | Accuracy(Fairness) | paper |
MBPP | coding | 0.37k/0.09k/0.5k | Accuracy | paper |
Topical-Chat | chat | 11k | Coherence, Knowledge grounding, Contextual relevance | paper |
WinoGrande | choice | 9.25k/1.25k/1.77k | Accuracy | paper |
CommonsenseQA | commonsense knowledge QA | 12k | Accuracy | paper |
FinQA | Numerical Reasoning over Financial Data | 8.28k | Accuracy | paper |
boolq | yes/no questions | 9.23k//3.27k | Accuracy | paper |
GSMHard | math | 1.32k// | Correctness | |
SVAMP | math | 1k | Accuracy with emantic variations | paper |
ConvFinQA | Numerical Reasoning in Conversational Finance | 3k/0.4k/0.4k | Correctness in neural symbolic methods and prompting-based methods | paper |
TAT-QA | Finance QA | 16k | Correctness | paper |
MultiArith | math | 420//180 | Accuracy, Precision, Recall, and F1-score | |
common_gen | constrained text generation task | 67.4k/4.02k/1.5k | Coherent | paper |
Toolbench | Tool Usage | 16k | API function call success rate | paper |
RestBench | Resolve instructions | 157 | Understand and execute complex instructions | paper |
ToolQA | Use external tools for question answering | 1.5k | Success rate in answering questions | paper |
- Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation
- Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game
- Welfare Diplomacy: Benchmarking Language Model Cooperation
- Rethinking the Buyer’s Inspection Paradox in Information Markets with Language Agents
- Lyfe Agents: generative agents for low-cost real-time social interactions
- SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series
- SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
- Theory Of Mind For Multi-agent Collaboration Via Large Language Models [paper]
- Evaluating Large Language Models at Evaluating Instruction Following [paper]
- AgentBench: Evaluating LLMs as Agents
- Identifying the Risks of LM Agents with an LM-Emulated Sandbox*
- Evaluating Multi-Agent Coordination Abilities in Large Language Models
- SmartPlay : A Benchmark for LLMs as Intelligent Agents
Acknowledging all the paper authors for their excellent works. We also extend our thanks to all contributors.
For Contribution: There are cases where we miss important works in this field, please contribute to this repo! Thanks for the efforts in advance.
For any question, feel free to contact us. We also welcome any form of collaboration.
Email: shizhl@mail.sdu.edu.cn