This is a vision-enabled web-browsing agent capable of controlling the mouse and keyboard.
It extends the Langchain Expression Language with the ability to coordindate multiple chain(or actors) across multiple steps of computation in a cyclic manner( adding cycles to your LLM application).
Cycles are important for agent-like behaviors, you call the llm in a loop, asking it what action to take next.
It works by viewing annotated browser screenshots for each turn, then choosing the next step to take.The agent architecure is basic reasoning and action (ReAct) loop. The unique aspects of this agent are:
-
Its usage of
set-of-markslike image to serve as UI affordances for the agent -
Its application in the browser by using tools to control both the mouse and keyboard
The overall desing look like the following
Install packages
pip install langchain langchain-core langchain_openai langgraph playwright
