GPTComputerAutomation is a tool that leverages OpenAI's GPT-4o model to analyze screenshots and UI elements, automating mouse clicks and typing actions on your computer.
[Full demo on x.com](https://twitter.com/Charles12509909/status/1796541659628638587
- UI Element Capture: Recursively captures visible, clickable UI elements with their coordinates.
- Screenshot Functionality: Takes and encodes screenshots for analysis.
- GPT Integration: Uses GPT-4o to intelligently determine the next click or typing action based on UI analysis.
- Automation Execution: Performs the determined actions on your computer, mimicking user interaction.
Before you begin, ensure you have met the following requirements:
- Python 3.x installed on your machine
- Pip (Python package installer)
- Git (for cloning the repository)
-
Clone the Repository
git clone https://github.com/CK92149/GPTComputerAutomation.git cd GPTComputerAutomation ```sh
-
Install the Dependencies
pip install -r requirements.txt ```sh
-
Set Up Environment Variables
Create a
.env
file in the root directory and add your OpenAI API key:API_KEY=your_api_key_here ```sh
-
Run the Application
python main.py ```sh
-
Interact with the GUI
- Enter your request in the input field.
- Click "Perform Action" to let GPT analyze the UI elements and determine the next action.
-
Capture UI Elements
The tool captures visible and clickable UI elements on the screen, saving them to a file (
ui_elements.txt
). -
Take a Screenshot
A screenshot is taken and saved as
screenshot.png
, then encoded to base64 for analysis. -
Analyze with GPT
The base64-encoded screenshot and UI elements are sent to the GPT model to determine the most likely next action.
-
Perform the Action
The determined action (mouse click or typing) is performed on your computer.
This project is licensed under the MIT License - see the LICENSE file for details.
If you want to contact me, you can reach me on x.com @Charles12509909