An experimental script for using AI to interact with Windows applications through natural mouse movements, keystrokes, and intelligent screen analysis. With the help of LangChain, PyAutoGUI, this project automates tasks based on real-time visual and contextual information, enabling a pseudo-user experience powered by an AI-driven approach.
- Natural Interaction: Simulates human-like mouse movements and typing delays.
- Task-Specific Execution: Analyzes the current screen and determines one actionable step for task completion.
- Incremental Task Handling: Builds upon previous actions to complete complex, multi-step tasks.
- Screenshot-Based Analysis: Uses screenshots to verify the interface state and take appropriate actions.
- Customizable Actions: Adapts actions based on contextual clues for a seamless, stepwise task approach.
- Python 3.x
- LangChain Ollama - Run an Ollama LLM model locally.
- Required Libraries:
pyautogui
,keyboard
,pytesseract
,numpy
,Pillow
,random
,time
-
Clone the repository:
git clone https://github.com/fzkhan19/ai-computer-use.git cd ai-computer-use
-
Install Dependencies:
pip install pyautogui keyboard pillow langchain_ollama pytesseract numpy
-
Set Up LangChain Ollama Model:
- Start the Ollama LLM server locally at
http://localhost:11434
.
- Start the Ollama LLM server locally at
-
Run the Program:
python main.py
- The program starts listening for tasks. Type a task (e.g., "Open Notepad and type Hello").
- The AI will process the screen state and decide the next best action based on the prompt.
- Press ESC to stop the program at any time.
> Open Notepad and type Hello
> Open Chrome and search cats
> Launch Calculator
- Integrate with additional AI models to expand task handling capability.
- Extend platform compatibility for Linux and Mac.
- Add advanced gesture-based interactions for dynamic tasks.
Contributions are welcome! Feel free to fork the repo, submit PRs, or discuss ideas in the Issues tab.