/agents-for-computer-use

Official Repository of "A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions"

Primary LanguageHTML

AI Agents for Computer Use

An awesome list of computer control agents (GUI automation of desktop and mobile devices) ๐Ÿš€.

Please have a look at our website for more information.

Repository Contents

Agents

  • Abukadah et al. - Mapping Natural Language Intents to User Interfaces through Vision-Language Models
  • Bishop et al. - Latent State Estimation Helps UI Agents to Reason
  • Bonatti et al. - Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
  • Branavan et al. - Reinforcement Learning for Mapping Instructions to Actions
  • Chae et al. - Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
  • Cheng et al. - Seeclick: Harnessing gui grounding for advanced visual gui agents
  • Cho et al. - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only
  • Deng et al. - Mind2Web: Towards a Generalist Agent for the Web
  • Deng et al. - Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
  • Deng et al. - On the Multi-turn Instruction Following for Conversational Web Agents
  • Ding et al. - MobileAgent: enhancing mobile control via human-machine interaction and SOP integration
  • Dorka et al. - Training a Vision Language Model as Smartphone Assistant
  • Fereidouni et al. - Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning
  • Furuta et al. - Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
  • Furuta et al. - Multimodal Web Navigation with Instruction-Finetuned Foundation Models
  • Gao et al. - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
  • Guan et al. - Intelligent Virtual Assistants with LLM-based Process Automation
  • Guo et al. - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
  • Gur et al. - A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
  • Gur et al. - Environment Generation for Zero-Shot Compositional Reinforcement Learning
  • Gur et al. - Learning to Navigate the Web
  • Gur et al. - Understanding HTML with Large Language Models
  • He at al. - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
  • Hong et al. - CogAgent: A Visual Language Model for GUI Agents
  • Humphreys et al. - A data-driven approach for learning to control computers
  • Iki et al. - Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts
  • Jia et al. - DOM-Q-NET: Grounded RL on Structured Language
  • Kil et al. - Dual-View Visual Contextualization for Web Navigation
  • Kim et al. - Language Models can Solve Computer Tasks
  • Koh et al. - Tree Search For Language Model Agents
  • Lai et al. - AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
  • Lee et al. - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
  • **Li ** - Learning UI Navigation through Demonstrations composed of Macro Actions
  • Li et al. - A Zero-Shot Language Agent for Computer Control with Structured Reflection
  • Li et al. - AppAgent v2: Advanced Agent for Flexible Mobile Interactions
  • Li et al. - Glider: A Reinforcement Learning Approach to Extract UI Scripts from Websites
  • Li et al. - Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations
  • Li et al. - Mapping Natural Language Instructions to Mobile UI Action Sequences
  • Li et al. - On the Effects of Data Scale on Computer Control Agents
  • Li et al. - UINav: A Practical Approach to Train On-Device Automation Agents
  • Lin et al. - Automating Web-based Infrastructure Management via Contextual Imitation Learning
  • Liu et al. - Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
  • Lu et al. - GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
  • Lu et al. - OmniParser for Pure Vision Based GUI Agent
  • Lu et al. - WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
  • Lutz et al. - WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
  • Ma et al. - CoCo-Agent: Comprehensive Cognitive LLM Agent for Smartphone GUI Automation
  • Ma et al. - LASER: LLM Agent with State-Space Exploration for Web Navigation
  • Mazumder et al. - FLIN: A Flexible Natural Language Interface for Web Navigation
  • Murty et al. - BAGEL: Bootstrapping Agents by Guiding Exploration with Language
  • Nakano et al. - WebGPT: Browser-assisted question-answering with human feedback
  • Niu et al. - ScreenAgent: A Vision Language Model-driven Computer Control Agent
  • Nong et al. - MobileFlow: A Multimodal LLM For Mobile GUI Agent
  • Pan et al. - Autonomous Evaluation and Refinement of Digital Agents
  • Putta et al. - Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
  • Rahman et al. - V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
  • Rawles et al. - Android in the Wild: A Large-Scale Dataset for Android Device Control
  • Shaw et al. - From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
  • Shi et al. - World of Bits: An Open-Domain Platform for Web-Based Agents
  • Sodhi et al. - HeaP: Hierarchical Policies for Web Actions using LLMs
  • Song et al. - MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot
  • Song et al. - Navigating Interfaces with AI for Enhanced User Interaction
  • Song et al. - RestGPT: Connecting Large Language Models with Real-World RESTful APIs
  • Song et al. - VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning
  • Lo et al. - Hierarchical Prompting Assists Large Language Model on Web Navigation
  • Sun et al. - AdaPlanner: Adaptive Planning from Feedback with Language Models
  • Sun et al. - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
  • Tao et al. - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
  • Wang et al. - Enabling Conversational Interaction with Mobile UI using Large Language Models
  • Wang et al. - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
  • Wang et al. - OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
  • Wen et al. - AutoDroid: LLM-powered Task Automation in Android
  • Wen et al. - DroidBot-GPT: GPT-powered UI Automation for Android
  • Wu et al. - MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
  • Wu et al. - OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT
  • Xu et al. - Grounding Open-Domain Instructions to Automate Web Support Tasks

Datasets

  • Shi et al. - World of Bits: An Open-Domain Platform for Web-Based Agents
  • Liu et al. - Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
  • Xu et al. - Grounding Open-Domain Instructions to Automate Web Support Tasks
  • Gur et al. - Environment Generation for Zero-Shot Compositional Reinforcement Learning
  • Yao et al. - WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
  • Deng et al. - Mind2Web: Towards a Generalist Agent for the Web
  • Koroglu et al. - QBE: QLearning-Based Exploration of Android Applications
  • Rawles et al. - Android in the Wild: A Large-Scale Dataset for Android Device Control
  • Zhou et al. - WebArena: A Realistic Web Environment for building autonomous agents
  • Li et al. - Mapping Natural Language Instructions to Mobile UI Action Sequences
  • Toyama et al. - AndroidEnv: A Reinforcement Learning Platform for Android
  • Burns et al. - A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
  • Xie et al. - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
  • Shvo et al. - AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning
  • Sun et al. - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
  • Liu et al. - AgentBench: Evaluating LLMs as Agents
  • Chen et al. - WebVLN: Vision-and-Language Navigation on Websites
  • Song et al. - RestGPT: Connecting Large Language Models with Real-World RESTful APIs
  • Koh et el. - VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks
  • Deng et al. - On the Multi-turn Instruction Following for Conversational Web Agents
  • Kapoor et al. - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
  • Wen et al. - Empowering LLM to use Smartphone for Intelligent Task Automation
  • Gao et al. - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
  • Niu et al. - ScreenAgent: A Vision Language Model-driven Computer Control Agent
  • Drouin et al. - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
  • Lai et al. - AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
  • **Zhang et al. ** - Android in the Zoo: Chain-of-Action-Thought for GUI Agents
  • Chen et al. - GUICourse: From General Vision Language Models to Versatile GUI Agents
  • Guo et al. - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
  • Venkatesh et al. - UGIF: UI Grounded Instruction Following
  • Zheng et al. - AgentStudio: A Toolkit for Building General Virtual Agents
  • Zhang et al. - Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction
  • Chen et al. - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
  • Chai et al. - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Citation

If helpful, please cite:

@misc{sager_acu_2025,
      title={A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions}, 
      author={Pascal J. Sager and Benjamin Meyer and Peng Yan and Rebekka von Wartburg-Kottler and Layan Etaiwi and Aref Enayati and Gabriel Nobel and Ahmed Abdulkadir and Benjamin F. Grewe and Thilo Stadelmann},
      year={2025},
      eprint={2501.16150},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.16150}, 
}

Website License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.