Papers for LLM-based Agents Collaboration

In the era of large language models (LLMs), LLM-based agents have shown remarkable performance in several existing benchmarks or real-world applications. Nevertheless, they still face difficulties in tackling complex tasks. Inspired by collaborative problem solving, several recent works use the strategy of multi-agent collaboration as a potential solution.

We collect the Must-read papers to catch up and share the state-of-the-art methods, facilitating the related research.

LLM-based Agent

ReAct: Synergizing Reasoning and Acting in Language Models [paper] [code]

Dataset: HotpotQA, FEVER, ALFWorld, WebShop

Link: more previous works can be found in:

Thanks a lot for pioneering effort.

Multi-Agent Collaboration

[2023/10] Metaagents: Simulating Interactions Of HuMan Behaviors For Llm-Based Task-Oriented Coordination Via Collaborative Generative Agents (Lehigh University)[paper]

task: Task-oriented Social

[2023/10] GameGPT: Multi-agent Collaborative Framework For Game Development (AutoGame Research)[paper]

task: Coding, Game Development, Multi-Agent cooperation

[2023/10] Evaluating Multi-agent Coordination Abilities In Large Language Models (University of California, Santa Cruz) [paper]

task: Multi-agent coordination, LLM-ToM-Reasoning

[2023/10] Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models [paper] [code]

task: Visual Semantic Navigation
Dataset: HM3D

[2023/10] Dynamic Llm-Agent Network:An Llm-Agent Collaboration Framework With Agent Team Optimization[paper]

task:arithmetic reasoning, general reasoning, code generation.
Dataset:MATH, MMLU, HumanEval

[2023/10] Multi-agent Consensus Seeking Via Large Language Models (Westlake University)[paper]

task: Reasoning

[2023/10] Exploring Collaboration Mechanisms For Llm Agents: A Social Psychology View (National University of Singapore, NUS-NCS Joint Lab) [paper]

task: Multi-agent cooperation
Dataset: MMLU, MATH, BIG-Bench Benchmark

[2023/10] Corex: Pushing The Boundaries Of Complex Reasoning Through Multi-Model Collaboration[paper][code]

task:Reasoning
Dataset:GSM8K, MultiArith, SingleOP/SingleEQ, AddSub, AQuA, SVAMP,GSMHard,StrategyQA, CommonsenseQA, BoolQ ,AI2 Reasoning Challenge (ARC-c),BigBench,FinQA, ConvFinQA, TAT-QA

[2023/10] Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game[paper]

task:Werewolf game

[2023/10] AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems (Gaoling School of Artificial Intelligence, Renmin University of China)[paper]

task: Recommendation
Dataset: CDs and Vinyl, Office Products

[2023/10] Agentverse: Facilitating Multi-Agent Collaboration And Exploring Emergent Behaviors[paper][code]

task:Conversation, Mathematical Calculation, Logical Reasoning, Coding
Dataset:FED, Commongen-Challenge, MGSM, BigBench, Humaneval

[2023/10] Large Language Models Can Design Gametheoretic Objectives For Multi-Agent Planning[paper]

task: Embodied Intelligence
Dataset:ThreeDWorld Transport Challenge

[2023/10] Communicative Agents For Software Development (Tsinghua University) [paper]

task: Coding
Dataset: Camel

[2023/09] Chain-Of-Experts: When Llms Meet Complex Operations Research Problems[paper][code]

task: Math(LP)
Dataset: LPWP, ComplexOR

[2023/09] OKR-Agent: An Object And Key Results Driven Agent System With Hierarchical Self-Collaboration And Self-Evaluation[paper]

task: Storyboard Generation, Creative Writing, Trip Planning
Dataset: (case study)

[2023/09] Reason To Behave: Achieving Human-Level Task Execution For Physics-Based Characters[paper][code]

task: Path Planning
Dataset: MoCap

[2023/09] AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems (Gaoling School of Artificial Intelligence, Renmin University of China)[paper]

task: Recommendation
Dataset: CDs and Vinyl, Office Products

[2023/09] Adapting Llm Agents Through Communication [paper]

task:Path Planning, QA,Math reasoning
Dataset:ALFWorld, HotpotQA, GSM8k

[2023/09] Autoagents: A Framework For Automaticagent Generation [paper][code]

task:Open-ended Question Answer task,Trivia Creative Writing
Dataset: MT-bench

[2023/09] Metagpt: Meta Programming For A Multi-Agent Collaborative Framework[paper]

task:Coding
Dataset:HumanEval, MBPP, SoftwareDev

[2023/09] Oceangpt: A Large Language Model For Ocean Science Tasks[paper]

task:Ocean-related Task
Dataset: open-access literature，OCEANBENCH

[2023/09] Playing Repeated Games With Large Language Models[[paper](https://openreview.
[2023/09] Playing Repeated Games With Large Language Models[paper]

task:cooperation and coordination games.

[2023/09] Chateval: Towards Better Llm-Based Evaluators Through Multi-Agent Debate[paper]

task:QA
Dataset:FairEval, Topical-Chat

[2023/09] Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game[paper]

task:Werewolf game

[2023/09] Mindagent: Emergent Gaming Interaction[paper]

task:Planning,Coordination
Dataset: Cuisine World

[2023/09] Building Cooperative Embodied Agents Modularly With Large Language Model[paper][code]

task: Planning, Conversation, Cooperation
Dataset:ThreeDWorld Multi-Agent Transport (TDW-MAT)

[2023/09] Autoagent: Enabling Next-Gen Llm Applications Via Multi-Agent Conversation (Microsoft Research) [paper][code]

task:Math, QA, Decision, Coding, Chat, Chess
Dataset: MATH, Natural Questions, ALFWorld

[2023/09] Evaluating Multi-agent Coordination Abilities In Large Language Models (University of California, Santa Cruz) [paper]

task: Multi-agent coordination, LLM-ToM-Reasoning

[2023/08] AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. (Microsoft Research) [paper] [code]

task: Multi-agent Cooperation, Conversation, MMLU
Dataset: MATH, Natural Questions, ALFWorld, OptiGuide

[2023/08] Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. (University of Illinois Urbana-Champaign) [paper] [code]

task: Cognitive synergy
Dataset: BigBench, TriviaQA

[2023/08] CGMI: Configurable General Multi-Agent Interaction Framework. (East China Normal University) [paper]

task: Replicate human interactions in real-world scenarios

[2023/08] ProAgent: Building Proactive Cooperative AI with Large Language Models. (Institute for Artificial Intelligence, Peking University) [paper] [code]

task：Cooperative Reasoning, Planning
Dataset: Overcooked-AI

[2023/07] RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. (Columbia University) [paper] [code]

task: Communication, Path Planning, Reasoning
Dataset: RoCoBench

[2023/07] Communicative Agents For Software Development (Tsinghua University) [paper]

task: Coding
Dataset: Camel

[2023/06] When Large Language Model Based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm (Gaoling School of Artificial Intelligence Renmin University of China, Beijing, China)[paper]

task: User Simulation
Dataset: RecAgent

[2023/06] Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. (University of Alberta) [paper]

task: Multi-Agent coordination

[2023/05] Training Socially Aligned Language Models in Simulated Human Society. (Dartmouth College) [paper] [code]

task: Learn From Simulated Social Interactions
Dataset: Anthropic RLHF

[2023/05] SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. (Allen Institute for Artificial Intelligence) [paper] [code]

task: Reasoning, Path Planning
Dataset: ScienceWorld

[2023/05] ChatGPT as your Personal Data Scientist. (Auburn University) [paper]

task: AutoML
Dataset: UCI Machine Learning Repository, Cora

[2023/05] Agents: An Open-source Framework for Autonomous Language Agents. (ETH Zürich) [paper] [code]

task:Planning, Tool Usage, Multi-Agents communication

[2023/05] Improving Factuality and Reasoning in Language Models through Multiagent Debate. (Google Brain) [paper] [code]

task: Mathematical Reasoning, Strategic Reasoning
Dataset: GSM8K, MMLU

[2023/05] Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. (University of Edinburgh) [paper] [code]

task: Autonomously Improve

[2023/05] Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate. (Research Center for Social Computing and Information Retrieval Harbin Institute of Technology, China) [paper]

task: Multi-Agents Coordination
Dataset: αNLI, CSQA, COPA, e-CARE,Social IQa, PIQA, StrategyQA

[2023/05] SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. (Allen Institute for Artificial Intelligence) [paper] [code]

task: Reasoning, Path Planning
Dataset: ScienceWorld

[2023/05] ChatGPT as your Personal Data Scientist. (Auburn University) [paper]

task: AutoML
Dataset: UCI Machine Learning Repository, Cora

[2023/05] Agents: An Open-source Framework for Autonomous Language Agents. (ETH Zürich) [paper] [code]

task:Planning, Tool Usage, Multi-Agents communication

[2023/05] Improving Factuality and Reasoning in Language Models through Multiagent Debate. (Google Brain) [paper] [code]

task: Mathematical Reasoning, Strategic Reasoning
Dataset: GSM8K, MMLU

[2023/05] Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. (University of Edinburgh) [paper] [code]

task: Autonomously Improve

[2023/05] Training Socially Aligned Language Models in Simulated Human Society. (Dartmouth College) [paper] [code]

task: Learn From Simulated Social Interactions
Dataset: Anthropic RLHF

[2023/01] Blind Judgement: Agent-Based Supreme Court Modelling With GPT. (McGill University) [paper]

task: Reasoning, Prediction
Dataset: SCDB

Datasets

We gather information on commonly used datasets for reference. Please be aware that there may be slight difference in the dataset due to different versions.

Name (link)	Task	Number	Evaluation*	Paper
Hotpot-QA	open-domain QA	train/dev/test: 88k/5.6k/5.6k	Exactly Match (EM)	paper
mmlu	multiple-choice questions	train/dev/test: 99.8k/285/1.531k	Multitask Accuracy	paper
math	reasoning	1.25k	Exactly Match(EM)	paper
ALFWorld	Embodied AI	3.5k//	Generalization	paper
Natural Questions	QA	30.7k//0.78k	Exactly Match(EM)	paper
GSM8K	reasoning	7.5k//1.062k	Exactly Match(EM)	paper
HumanEval	coding	164 handwritten programming questions	Correctness	paper
BigBench	coding	214 tasks	Correctness, Fluency	paper
AI2 Reasoning Challenge	choice question	3.37k/0.87k/3.55k	Correctness	paper
MGSM	Math	8/0.25k	Exactly Match(EM)	paper
FairEval	llm evaluation	80	Accuracy(Fairness)	paper
MBPP	coding	0.37k/0.09k/0.5k	Accuracy	paper
Topical-Chat	chat	11k	Coherence, Knowledge grounding, Contextual relevance	paper
WinoGrande	choice	9.25k/1.25k/1.77k	Accuracy	paper
CommonsenseQA	commonsense knowledge QA	12k	Accuracy	paper
FinQA	Numerical Reasoning over Financial Data	8.28k	Accuracy	paper
boolq	yes/no questions	9.23k//3.27k	Accuracy	paper
GSMHard	math	1.32k//	Correctness
SVAMP	math	1k	Accuracy with emantic variations	paper
ConvFinQA	Numerical Reasoning in Conversational Finance	3k/0.4k/0.4k	Correctness in neural symbolic methods and prompting-based methods	paper
TAT-QA	Finance QA	16k	Correctness	paper
MultiArith	math	420//180	Accuracy, Precision, Recall, and F1-score
common_gen	constrained text generation task	67.4k/4.02k/1.5k	Coherent	paper
Toolbench	Tool Usage	16k	API function call success rate	paper
RestBench	Resolve instructions	157	Understand and execute complex instructions	paper
ToolQA	Use external tools for question answering	1.5k	Success rate in answering questions	paper

Simulation with Multi-agent

Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation
Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game
Welfare Diplomacy: Benchmarking Language Model Cooperation
Rethinking the Buyer’s Inspection Paradox in Information Markets with Language Agents
Lyfe Agents: generative agents for low-cost real-time social interactions
SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Evaluation

Theory Of Mind For Multi-agent Collaboration Via Large Language Models [paper]
Evaluating Large Language Models at Evaluating Instruction Following [paper]
AgentBench: Evaluating LLMs as Agents
Identifying the Risks of LM Agents with an LM-Emulated Sandbox*
Evaluating Multi-Agent Coordination Abilities in Large Language Models
SmartPlay : A Benchmark for LLMs as Intelligent Agents

Acknoledgement

Acknowledging all the paper authors for their excellent works. We also extend our thanks to all contributors.

For Contribution: There are cases where we miss important works in this field, please contribute to this repo! Thanks for the efforts in advance.

Contact

For any question, feel free to contact us. We also welcome any form of collaboration.

Email: shizhl@mail.sdu.edu.cn

shizhl/Multi-Agent-Papers