Add vision support

Question

ErikBjare opened this issue 10 months ago · 0 comments

Since the OpenAI API now has vision in beta, and we could use LLaVa locally.

Might be a lot of work, or might be super easy.

Question is, what would it be useful for?

#51: Xvfb to understand display/output and make a E2E desktop agent
#52: Screenshot with browser tool
- Can be used to take screenshots of developed webapps for visually-aided autodebugging
Have it review plot outputs for correctness and to inspect results
- Could be useful for data science, but reading a good plain text output might still be superior