*Notes:
- Added Webhook to Discord server
- Added drop down list for load of local images
- Vision now powered by GPT 4o
Current setup:
- LLM: Localhost KoboldCPP
- LLM Vision: OpenAI GPT 4o
- Agentic Processing: Flowise in Docker
- STT: Whisper API Paid
- TTS: OpenAI TTS Paid
- SDXL: SD API Paid
- SD3: SD API Paid
- Added Stable Diffusion 1 and 3 functionality
- Integrated
GeminiVision (flexible) feature with embedded API widget
Project Overview This is a web-based application built using R and Shiny, a web application framework for R. A demo video can be viewed here on Youtube. It is a comprehensive audio processing and generation system that integrates various functionalities, including:
- Audio Recording and Transcription: Users can record audio, and the application will transcribe the audio into text.
- Chatbot/LLM Integration: The transcribed text is sent to a Large Language Model (LLM) or chatbot, which generates a response.
- Speech Synthesis: The chatbot's response is converted into an audio file using a text-to-speech (TTS) engine.
- Music Generation: The application can generate music based on user input, using a music generation API.
Functionality
The application consists of several modules:
- Audio Recorder: Records user audio input.
- Transcription: Transcribes the recorded audio into text.
- Chatbot/LLM: Sends the transcribed text to a chatbot or LLM, which responds with a textual output.
- Speech Synthesis: Converts the chatbot's response into an audio file.
- Music Generation: Generates music based on user input using a music generation API.
- Suno Song Generation: Generates a song using the music generation API, with options to control instrumental, tags, and title.
- Stable Diffusion 1 and 3: Utilizes Stable Diffusion models for image generation and manipulation.
- Gemini Vision: Integrates an embedded API widget for computer vision capabilities.
UI Components
The user interface includes:
- Audio Recorder: Buttons for recording and stopping audio input.
- Transcription Box: Displays the transcribed text.
- Chatbot/LLM Output: Displays the chatbot's response.
- Speech Synthesis: Button to convert the chatbot's response into an audio file.
- Suno Song Generation: Inputs for title, song name, and options to generate a song.
- Audio Player: Allows users to play and stop audio files.
- Gemini Vision Widget: Embedded API widget for computer vision capabilities.