User interacts with app through voice. Query is extracted to text and routed to FLASK server. FLASK server prompts the LLM with query and send the response back to app. App 'responds' to the user in voice.
[GET request for text query. POST request for image query.]
[On FLASK server, user can send a URL (with GET request) to recieve summary of the text on the webpage.]