/agent-for-debate

Agent4Debate is a dynamic multi-agent framework that leverages LLMs to achieve human-level performance in competitive debate by dynamically coordinating specialized agents to mitigate hallucinations and enhance competitiveness.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Agent4Debate: Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Update

  • 2024-10-30: Opensource our code.
  • 2024-08-09: Ongoing work, coming soon.

Overview of Agent4Debate

Figure 1. The overview of Agent4Debate. A dynamic multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Agent4Debate employs a collaborative architecture where four specialized agents, involving Searcher, Analyzer, Writer, and Reviewer, dynamically interact and cooperate.

Task Definition

Our debate structure comprises three distinct stages, namely constructive arguments, rebuttals, and summary statements.

Figure 2. Competitive debate format.

Elo Ranking

We collect records of 200 debate matches, covering 66 debate motions across three categories, including Fact, Value, and Policy. Participants included Agent4Debate using different foundation models, two baselines, and ten human debaters. Then, we calculate Elo scores for all 200 matches and sub-Elo scores for each of the three debate categories.

Debatrix-Elo Ranking

Model Full Fact Policy Value
Gemini-1.5-Pro 1034.15 1154.93 1231.98 1075.30
Claude-3.5-sonnet 1032.51 1159.18 1224.19 1074.33
Qwen2-72b-Instruct 1023.31 1130.83 1179.62 1081.75
GPT-4o 1022.21 1150.14 1137.49 1069.55
Gemini-1.5-Flash 1012.45 1136.21 1156.50 1057.73
GLM-4-Air 1011.72 1155.07 1148.53 1048.42
Deepseek-chat 1004.00 1118.98 1131.16 1054.89
Claude-3.5-sonnet* 982.07 479.50 956.21 1021.44
Human 978.35 1109.73 515.57 953.05
Deepseek-Chat* 954.34 491.13 478.78 983.99

Table 3: Debatrix-Elo Ranking.

Human-Elo Ranking

Model Full Fact Policy Value
Gemini-1.5-Pro 1040.64 1110.23 1104.79 1048.10
Claude-3.5-sonnet 1031.15 1093.87 1104.44 1020.05
GPT-4o 1028.84 1086.78 1099.63 1033.09
Human 1006.46 1055.82 1030.32 1006.57
Gemini-1.5-Flash 1000.00 1037.45 997.66 1003.29
Qwen2-72b-Instruct 999.70 1041.10 976.16 1005.56
Claude-3.5-sonnet* 991.38 1023.29 968.34 997.47
GLM-4-Air 972.48 940.00 948.31 996.67
Deepseek-chat 971.94 963.05 946.30 986.79
Deepseek-Chat* 962.61 786.44 911.33 979.29
Table 4: Human-Elo Ranking.

Figure 5. Predicted Win Rates Using Elo Rankings for Model A in A vs. B Battles.

Case Study

(Due to length constraints, the references have been omitted. Translated from Chinese by Claude-3.5-sonnet. If you want to read the original Chinese version, pleas click here)

Figure 4. Case study of the debate motion "Justice is nothing but interest. (Pro side) / Justice is nothing more than interest (Con side)". Pro side is Agent4Debate (GPT-4o), Con side is Agent4Debate (Claude-3.5-sonnet).

Related Work

AIDebater 2024

Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

Chatbot Arena: New models & Elo system update