/multimodal-gpt

Primary LanguagePythonOtherNOASSERTION

A Screenshot-based Multimodal GPT Assistant

  1. Python sounddevice for recording audio until you stop speaking
  2. Whisper API for transcribing audio
  3. OpenAI TTS for speech
  4. PyWinCtl and pyautogui for screenshots of a specific window
  5. OpenAI Vision API to process the screenshot and answer your prompt

Installation

python -m venv venv
. venv/bin/activate
pip install -r requirements.txt

Run

python main.py

Configuration

All project-wide settings are in settings.py.