a-real-ai/pywinassistant

Is it only used for windows?

Opened this issue · 5 comments

Is it only used for Windows or we could also run it in Linux?

Currently, it only supports Windows as it is the most common OS in the market.

I have simple stages of this architecture implementation for Linux, but it is not even half finished. The Chat-GPT LLM's seem to have a better understanding of spatial utilization for apps built for Windows, and tries to assume it's the same spatial utilization of apps for Linux. For those reasons I'm implementing it on Ubuntu first, other than other OS's with potentially different UI/UX designs.

The same architecture of visualization of thought can be applied to other OS's like Linux, MacOS and Android. As I'm a single person working on it, my current goal is to make first the best assistant for Windows, then I'll move on fully with Linux and MacOS.

@henyckma btw did you try to incorporate some informations from tools such as Microsofts accessibility features, it offers user interface automation and there is the Narator. With the inspect tool you have a hierarchical overview of the whole UI and its Buttons/texts/elements https://learn.microsoft.com/en-gb/windows/win32/winauto/inspect-objects?redirectedfrom=MSDN maybe you could provide the descriptions for the screen reader as context for the LLM, so it has a description what a button does or what different UI Elements can do. As for example In word when you would like to toggle Bold, the screenreader would say something like „Application Word, Document xxx, Bold Button, on toggle makes selected Text Bold“. I think this would be valuable information for an Agent

@henyckma btw did you try to incorporate some informations from tools such as Microsofts accessibility features, it offers user interface automation and there is the Narator. With the inspect tool you have a hierarchical overview of the whole UI and its Buttons/texts/elements https://learn.microsoft.com/en-gb/windows/win32/winauto/inspect-objects?redirectedfrom=MSDN maybe you could provide the descriptions for the screen reader as context for the LLM, so it has a description what a button does or what different UI Elements can do. As for example In word when you would like to toggle Bold, the screenreader would say something like „Application Word, Document xxx, Bold Button, on toggle makes selected Text Bold“. I think this would be valuable information for an Agent

Great thought, I had already wondered if it was already using something like this along with the vision prompts.

@henyckma What would you use on MacOS as a Win32API equivalent? I want to implement something similar for my own use case on MacOS.

@Razorbob thank you! As some systems lack of inspect.exe and installing the SDK to get it requieres administrator privileges (as well to use it), im avoiding utilizing administrator privileges at all costs. By giving the LLM the app name and some elements of the application, it can guess/hallucinate the state of the application, if it is on the home screen or in some place in particular (for example on spotify, guesses if it is on the homescreen or inside an album).

I had further tested inspect.exe but it didnt improved the spatial reasoning.

@OlypsisAli I'm temporally lacking of a MacOS to develop it, but I believe it is PyObjC.
https://gist.github.com/amomchilov/096ce5ceb9f4fca942ae0dd37066bc11

I assume that OpenAI is working on a LAM for MacOS. But I dont believe theyre gonna call it LAM.

Their GPT-4o (Omni) model lacks of a LAM (Large Action Model) as PyWinAssistant lacks of a Omni model.

Im integrating the Omni model into PyWinAssistant chat to have "a real Jarvis".

The Analyze button was intended to have a real time analysis and interaction with the user, just as their recent demos of the Omni model.

Also im integrating a real time step analysis / step modifier to achieve the best real time LAM.