AI Agents Papers

Updated biweekly.

AI Agent

AI agents can think, act, and complete tasks by themselves.
But can they really replace our jobs?

AI Agent Workflows

Paper Categories

🔥: Recommended papers
📖: Survey papers
⚖️: Benchmark papers

Agent Capabilities
- Ideation
- Planning
- Reasoning
- Profile
- Perception
- Tool Use
- Self-Correction
- Memory
- Self-Evolution
- Safety
- Agent Tuning
- Agent Evaluation
AI Agents Architecture
AI Agents Applications
GenAI Agents Presentations
- Tutorial & Lecture

References

October Highlights (Updated 26 Oct)

"ACON: Optimizing Context Compression for Long-horizon LLM Agents" [paper]
"Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models" [paper]
"ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory" [paper]
"Learning on the Job: An Experience-Driven, Self-Evolving Agent for Long-Horizon Tasks" [paper]
"Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents" [paper]
"Where LLM Agents Fail and How They Can Learn From Failures" [paper]
"AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning" [paper]
"Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research" [paper]
"Artificially intelligent agents in the social and behavioral sciences: A history and outlook" [paper]
"Agentic Services Computing" [paper]
"Don’t Just Fine-tune the Agent, Tune the Environment" [paper]
"Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks" [paper]
"LLM-REVal: Can We Trust LLM Reviewers Yet?" [paper]
⚖️ "Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation" [paper]
📖 "Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey" [paper]
📖 "Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents" [paper]
📖 "Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI" [paper]
"LLM Agents Beyond Utility: An Open-Ended Perspective" [paper]
"Deep Self-Evolving Reasoning" [paper]
📖 "A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications" [paper]
"BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?" [paper]

September Highlights (Updated 28 Sep)

Ideation Task

"The Need for Verification in AI-Driven Scientific Discovery" [paper]
"What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models" [paper]
"LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance Social World Models" [paper]
"Language Models Do Not Follow Occam’s Razor: A Benchmark for Inductive and Abductive Reasoning" [paper]
"LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance" [paper]
"VulAgent: A Hypothesis Validation-Based Multi-Agent System for Software Vulnerability Detection" [paper]
"Tackling One Health Risks: How Large Language Models are leveraged for Risk Negotiation and Consensus-building" [paper]
"Agents of Discovery" [paper]

Long-Horizon Task

"ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization" [paper]
"Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning" [paper]
"The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs" [paper]
"WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents" [paper]
⚖️ "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [paper]
"Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks" [paper]

Long-Context Task

⚖️ "LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering" [[paper]]
"SWE-QA: Can Language Models Answer Repository-level Code Questions?" [paper]
"ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory" [paper]
"ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization" [paper]

Agent Tuning

"rStar2-Agent: Agentic Reasoning Technical Report" [paper]
"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey" [paper]
"Scaling Agents via Continual Pre-training" [paper]
"Online Process Reward Learning for Agentic Reinforcement Learning" [paper]
"Tree Search for LLM Agent Reinforcement Learning" [paper]
"LIMI: Less is More for Agency" [paper]

Self-Evolving Agents

"ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution" [paper]
"Towards General Agentic Intelligence via Environment Scaling" [paper]
"Agent²: An Agent-Generates-Agent Framework for Reinforcement Learning Automation" [paper]
"Self-Improving Embodied Foundation Models" [paper]
"Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution" [paper]

Survey

📖 "LLM-based Agentic Reasoning Frameworks: A Survey from Methods to Scenarios" [paper]
📖 "Reinforcement Learning Foundations for Deep Research Systems: A Survey" [paper]
📖 "LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions" [paper]
📖 "LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines" [paper]

August Highlights

Self-Evolving Agents

"Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance" [paper]
📖 "A Comprehensive Survey of Self-Evolving AI Agents" [paper]
"HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research" [paper]
"SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience"
"SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents" [paper]
"HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents" [paper]
⚖️ "Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark" [paper]

Memory based llm Agents

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [paper]
"Memp: Exploring Agent Procedural Memory" [paper]
"Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science" [paper]
"Coarse-to-Fine Grounded Memory for LLM Agent Planning" [paper]
"Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework" [paper]
"Memento: Fine-tuning LLM Agents without Fine-tuning LLMs" [paper]
"Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning" [paper]

Ideation Agents

"K-Dense Analyst: Towards Fully Automated Scientific Analysis" [paper]
📖 "From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery" [paper]
"The AI Data Scientist" [paper]
"Spacer: Towards Engineered Scientific Inspiration" [paper]
"BIODISCO: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation" [paper]
"Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization" [paper]
"MK2 at PBIG Competition: A Prompt Generation Solution" [paper]

July Highlights

Agent Blueprints

"LLM Agents Are the Antidote to Walled Gardens", University of Oxford. [paper]
"Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture", State Key Laboratory. [paper]
"Aime: Towards Fully-Autonomous Multi-Agent Framework", ByteDance. [paper]
📖 "A Survey of Context Engineering for Large Language Models" [paper]
📖 "A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents" [paper]
"From Reasoning to Super-Intelligence: A Search-Theoretic Perspective", AA-I. [paper]
"Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs", University of Michigan. [paper]

Agent Applications

"Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications", Shandong University. [paper]
"Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation", Heinrich Heine University. [paper]
"Agent Ideate: A Framework for Product Idea Generation from Patents Using Agentic AI", TCS Research. [paper]
"Agent Exchange: Shaping the Future of AI Agent Economics", Shanghai Jiao Tong University. [paper]
"Evaluating LLM Agent Collusion in Double Auctions", Relativity, Stanford University, Arb Research. [paper]
"Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models", Queen’s University, IBM USA. [paper]
"CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale", Duke University, Army Research Laboratory. [paper]
"Deep Researcher with Test-Time Diffusion", Google. [paper]

Enterprise Agents

"AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents", Accenture. [paper]
"Agentic Retrieval of Topics and Insights from Earnings Calls", Bloomberg. [paper]
⚖️ "Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments", Fudan University. [paper]
"Routine: A Structural Planning Framework for LLM Agent System in Enterprise", Digital China AI Research. [paper]
"Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance", ByteDance. [paper]
"Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments", Meta. [paper]

Data Agents

"Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems", Tsinghua University. [paper]
⚖️ "DABstep: Data Agent Benchmark for Multi-step Reasoning", Adyen, Hugging Face. [paper]
📖 "Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence", Zhejiang University. [paper]

Research Agents

📖 "The Evolving Role of Large Language Models in Scientific Innovation: Evaluator, Collaborator, and Scientist", University of North Texas. [paper]
"AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench", Meta. [paper]
"Open-ended Scientific Discovery via Bayesian Surprise", Allen Institute for AI. [paper]
"Large Language Models as Innovators: A Framework to Leverage Latent Space Exploration for Novelty Discovery", Wrocław University. [paper]

Role Playing Agents

"Too Human to Model: The Uncanny Valley of LLMs in Social Simulation", Atmospheric Environmental Research. [paper]
"Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust", CAMEL-AI.org. [paper]
"LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra",Princeton University, Salesforce Research. [paper]
📖 "Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle" [paper]
"Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models", CIFAR AI Chair. [paper]

Memory

"MemOS: A Memory OS for AI System", MemTensor (Shanghai) Technology Co., Ltd. [paper]
⚖️ "Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions", UC San Diego. [paper]
"MIRIX: Multi-Agent Memory System for LLM-Based Agents", MIRIX AI. [paper]

June Highlights

Deep Research Agents

⚖️ "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents" [paper]
📖 "From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents" [paper]
📖 "Deep Research Agents: A Systematic Examination And Roadmap" [paper]
📖 "Towards AI Search Paradigm" [paper]
"Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge" [paper]
"MMSearch-R1: Incentivizing LMMs to Search" [paper]
"Towards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence Retrieval" [paper]

Data Science Agents

[Jun 2025] "AUTOMIND: Adaptive Knowledgeable Agent for Automated Data Science" [paper]
📖 [Jun 2025] "Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents" [paper]
[Jun 2025] "SheetMind: An End-to-End LLM-Powered Multi-Agent Framework for Spreadsheet Automation" [paper]
[Jun 2025] "SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications" [paper]
[Jun 2025] "Towards Community-Driven Agents for Machine Learning Engineering" [paper]
[Jun 2025] "MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement" [paper]

Business Operation Agents

"Oversight Structures for Agentic AI in Public-Sector Organizations" [paper]
⚖️ "AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance" [paper]
📖 "Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives" [paper]
"Intelligent Design 4.0: Paradigm Evolution Toward the Agentic AI Era" [paper]
"Improved LLM Agents for Financial Document Question Answering" [paper]
⚖️ "ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering" [paper]
"Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine" [paper]
"SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models" [paper]
⚖️ "SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents" [paper]
"Intelligent Design 4.0: Paradigm Evolution Toward the Agentic AI Era" [paper]
"Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents" [paper]
"AgenticControl: An Automated Control Design Framework Using Large Language Models" [paper]
📖 "A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools" [paper]

May Highlights

Inference Time Computing

📖 "A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law" [paper]
📖 "Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models" [paper]

Tool Integrated Reasoning

"Table-R1: Inference-Time Scaling for Table Reasoning" [paper]
"Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning" [paper]
"Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning" [paper]
"Agent RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving" [paper]
"Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent" [paper]
"An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents" [paper]
"Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning" [paper]
"MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning" [paper]
"EvolveSearch: An Iterative Self-Evolving Search Agent" [paper]
"VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection" [paper]
"Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning" [paper]

Self-Improvement & Self-Evolution

Metric & Reward

"RM-R1: Reward Modeling as Reasoning" [paper]
"Reward Reasoning Model" [paper]
"R3: Robust Rubric-Agnostic Reward Models" [paper]
"AutoLibra: Agent Metric Induction from Open-Ended Feedback" [paper]

Memory

"MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models (Short Version)" [paper]
"MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents" [paper]
"MARK: Memory Augmented Refinement of Knowledge" [paper]
📖 "Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions" [paper]

Skills

"Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs" [paper]
"Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution" [paper]
"Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution" [paper]

Reasoning Model

"Absolute Zero: Reinforced Self-play Reasoning with Zero Data" [paper]
"Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks" [paper]
"DEBATE, TRAIN, EVOLVE: Self-Evolution of Language Model Reasoning" [paper]
"Self Rewarding Self Improving" [paper]
"EvolveSearch: An Iterative Self-Evolving Search Agent" [paper]

(Multi) Agent Architecture

"AlphaEvolve: A coding agent for scientific and algorithmic discovery" [paper]
"Meta-Design Matters:A Self-Design Multi-Agent System" [paper]
"Darwin Gödel Machine:Open-Ended Evolution of Self-Improving Agents" [paper]
"SEW: Self-Evolving Agentic Workflows for Automated Code Generation" [paper]
"Multi-Agent Collaboration via Evolving Orchestration" [paper]

Multi-Agent

📖 "Creativity in LLM-based Multi-Agent Systems: A Survey" [paper]
⚖️ "Benchmarking LLMs’ Swarm intelligence" [paper]
"Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems" [paper]
"Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications" [paper]
"Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study" [paper]

Real-World Application of AI Agents

Researcher

"34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery" [paper]
"PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration" [paper]
"R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution" [paper]
📖 "From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery" [paper]
"Towards Artificial Intelligence Research Assistant for Expert-Involved Learning" [paper]

Data Scientist

"MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering" [paper]
"ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering" [paper]
"Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics" [paper]
"Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories" [paper]
"JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation" [paper]
"MLZero: A Multi-Agent System for End-to-end Machine Learning Automation" [paper]

Software Engineer

"Can Agents Fix Agent Issues?" [paper]
"Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI" [paper]

Others

"The Real Barrier to LLM Agent Usability is Agentic ROI" [paper]
📖 "A Survey on Large Language Model based Human-Agent Systems" [paper]
📖 "Vision-Language-Action Models: Concepts, Progress, Applications and Challenges" [paper]
📖 "Multi-agent Embodied AI: Advances and Future Directions" [paper]
"Efficient Agent Training for Computer Use" [paper]
⚖️ "AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios" [paper]

April Highlights

Inference Time Computing

"Inference-Time Scaling for Generalist Reward Modeling" [paper]
"Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead"[paper]
"Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection"[paper]
"Dual Engines of Thoughts: A Depth-Breadth Integration Framework for Open-Ended Analysis"[paper]
📖 "A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems"[paper]

Self-Experience-Driven Agents

"Welcome to the Era of Experience" [paper]
"SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills"[paper]
"Exploring Expert Failures Improves LLM Agent Tuning" [paper]
"Inducing Programmatic Skills for Agentic Tasks" [paper]
"Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory" [paper]
"Local Prompt Optimization" [paper]
"Revisiting Prompt Optimization with Large Reasoning Models—A Case Study on Event Extraction" [paper]
"Iterative Trajectory Exploration for Multimodal Agents" [papaer]

Meta Agents

"FlowReasoner: Reinforcing Query-Level Meta-Agents" [paper]
"A Self-Improving Coding Agent" [paper]
"Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models" [paper]

Reinforcement Learning Applications for AI Agents

"ToolRL: Reward is All Tool Learning Needs" [paper]
"OTC: Optimal Tool Calls via Reinforcement Learning" [paper]
"LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities" [paper]
📖 "Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey" [paper]

Real-World Application of AI Agents

"The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search" [paper]
"UFO2: The Desktop AgentOS" [paper]
"AGENTADA: Skill-Adaptive Data Analytics for Tailored Insight Discovery"[paper]
⚖️ "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents" [paper]
"Toward Super Agent System with Hybrid AI Router" [paper] "AgentA/B: Automated and Scalable Web A/B Testing with Interactive LLM Agents" [paper]
[Apr 2025] "UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents" [paper]
📖 "Challenges and Paths Towards AI for Software Engineering"[paper]

Survey

📖 "Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems"[paper]
📖 "Adaptive Human-Agent Teaming: A Review of Empirical Studies from the Process Dynamics Perspective" [paper]
📖 "A Survey of AI Agent Protocols" [paper]

edwardt/ai-agent-papers

AI Agents Papers

AI Agent

Paper Categories

References

October Highlights (Updated 26 Oct)

September Highlights (Updated 28 Sep)

Ideation Task

Long-Horizon Task

Long-Context Task

Agent Tuning

Self-Evolving Agents

Survey

August Highlights

Self-Evolving Agents

Memory based llm Agents

Ideation Agents

July Highlights

Agent Blueprints

Agent Applications

Enterprise Agents

Data Agents

Research Agents

Role Playing Agents

Memory

June Highlights

Deep Research Agents

Data Science Agents

Business Operation Agents

May Highlights

Inference Time Computing

Tool Integrated Reasoning

Self-Improvement & Self-Evolution

Metric & Reward

Memory

Skills

Reasoning Model

(Multi) Agent Architecture

Multi-Agent

Real-World Application of AI Agents

Researcher

Data Scientist

Software Engineer

Others

April Highlights

Inference Time Computing

Self-Experience-Driven Agents

Meta Agents

Reinforcement Learning Applications for AI Agents

Real-World Application of AI Agents

Survey