This repo contains the code for an example of a text to speech server that uses the OpenAI API to fetch customised answers to questions based on a given context.
Note: Please note that there are some "silent" actions within the video to simulate a scenario where a user does not speak with the bot after a certain amount of time. Please do not skip those since you will see a custom message to catch your attention after some period of time.
Bupa.-.Showcase.31.May.2023.mp4
Other videos with earlier versions at assets/
- Python 3.9
- Clone the repo
- Install the dependencies with
pip install -r requirements.txt
- Create a
tts-server/.env
file with the following variables:
# options - openai or local
OPENAI_API_KEY=<your-openai-api-key>
OPENAI_MODEL=gpt-3.5-turbo
# options - vits-emo, tortoise or default
TTS_MODE=vits-emo
ROBOT_FILTER=true
COQUI_AI_BASE_URL=https://app.coqui.ai/api/v2/samples
COQUI_AI_API_KEY=<your-coqui-ai-api-key>
COQUI_AI_VOICE_ID=d2bd7ccb-1b65-4005-9578-32c4e02d8ddf
CONVERSATION_HISTORY=true
- To use the local gpt4all model, you first have to download it and place it under
tts-server/assets/bin
; - To use the model (vits-emo) that was trained for the purpose of this, please contact me so that I can provide you the URLs.
- Run the server with
python tts-server/main.py
- Access the server running at
http://localhost:8080/
, configure the Bopa bot and submit a question - Or, as an alternative, send a POST request to
http://localhost:8080/ask
with the following JSON body:
{
"mood": "happy",
"persona": "yoda",
"text": "What is human life expectancy in the United States?"
}
- Also, to get the speech representation of a text you can send a POST request to
http://localhost:5001/audio
with the following JSON body:
{
"mood": "happy",
"persona": "yoda",
"text": "The human expectancy in the United States fortunately is 78 years old."
}
- Build the Docker image with
docker build -t bupa-bot .
- Run the Docker container with
docker run -p 5001:8080 bupa-bot
- Access the server running at
http://localhost:5001/
, configure the Bopa bot and submit a question - Or, as an alternative, to get a response you can send a POST request to
http://localhost:5001/ask
with the following JSON body:
{
"mood": "happy",
"persona": "yoda",
"text": "What is human life expectancy in the United States?"
}
- Also, to get the speech representation of a text you can send a POST request to
http://localhost:5001/audio
with the following JSON body:
{
"mood": "happy",
"persona": "yoda",
"text": "The human expectancy in the United States fortunately is 78 years old."
}
A different set of models were created to generate speech with emotion. In the end, we found that the best results were achieved by fine tuning an existing VITS model and adding a multi speaker functionality where each of the speakers is an emotion.
The notebook used to train this model is available under notebooks/
such as the other models that were tested.
There are the results for our TTS model after 1.017.756 steps:
The filters were designed by a post production sound designer and applied using a set of Python libraries (kudos to Spotify Pedalboard librart).
- On the existing architecture, create a robot filter to apply to the final audio
- Create or adapt datasets with emotion for training the TTS models
- Apply the robot filter to the emotion dataset
- Train different models for different moods and personas (notebooks already available to train new models using GlowTTS and VITS)
- Add more moods and personas
- Use our own GPT model instead of the OpenAI API
This project was inspired by the following projects:
- OpenAI API
- Coquis-TTS
- Spotify Pedalboard