openai-tts-server

This repo contains the code for an example of a text to speech server that uses the OpenAI API to fetch customised answers to questions based on a given context.

Showcase

Bupa Chatbot Demo (31 May 2023)

Note: Please note that there are some "silent" actions within the video to simulate a scenario where a user does not speak with the bot after a certain amount of time. Please do not skip those since you will see a custom message to catch your attention after some period of time.

Bupa.-.Showcase.31.May.2023.mp4

Other videos with earlier versions at assets/

Prerequisites

Python 3.9

Installation

Clone the repo
Install the dependencies with pip install -r requirements.txt
Create a tts-server/.env file with the following variables:

# options - openai or local
OPENAI_API_KEY=<your-openai-api-key>
OPENAI_MODEL=gpt-3.5-turbo

# options - vits-emo, tortoise or default
TTS_MODE=vits-emo
ROBOT_FILTER=true

COQUI_AI_BASE_URL=https://app.coqui.ai/api/v2/samples
COQUI_AI_API_KEY=<your-coqui-ai-api-key>
COQUI_AI_VOICE_ID=d2bd7ccb-1b65-4005-9578-32c4e02d8ddf

CONVERSATION_HISTORY=true

To use the local gpt4all model, you first have to download it and place it under tts-server/assets/bin;
To use the model (vits-emo) that was trained for the purpose of this, please contact me so that I can provide you the URLs.

Usage

Using the CLI

Run the server with python tts-server/main.py
Access the server running at http://localhost:8080/, configure the Bopa bot and submit a question
Or, as an alternative, send a POST request to http://localhost:8080/ask with the following JSON body:

{
    "mood": "happy",
    "persona": "yoda",
    "text": "What is human life expectancy in the United States?"
}

Also, to get the speech representation of a text you can send a POST request to http://localhost:5001/audio with the following JSON body:

{
    "mood": "happy",
    "persona": "yoda",
    "text": "The human expectancy in the United States fortunately is 78 years old."
}

Using Docker

Build the Docker image with docker build -t bupa-bot .
Run the Docker container with docker run -p 5001:8080 bupa-bot
Access the server running at http://localhost:5001/, configure the Bopa bot and submit a question
Or, as an alternative, to get a response you can send a POST request to http://localhost:5001/ask with the following JSON body:

{
    "mood": "happy",
    "persona": "yoda",
    "text": "What is human life expectancy in the United States?"
}

Also, to get the speech representation of a text you can send a POST request to http://localhost:5001/audio with the following JSON body:

{
    "mood": "happy",
    "persona": "yoda",
    "text": "The human expectancy in the United States fortunately is 78 years old."
}

Text to Speech Training

A different set of models were created to generate speech with emotion. In the end, we found that the best results were achieved by fine tuning an existing VITS model and adding a multi speaker functionality where each of the speakers is an emotion.

The notebook used to train this model is available under notebooks/ such as the other models that were tested.

There are the results for our TTS model after 1.017.756 steps:

Sentence	Neutral	Angry	Sad	Happy	Surprised
"I am a crazy scientist."	0_mymodel_vits_output_1017756_neutral.webm	0_mymodel_vits_output_1017756_angry.webm	0_mymodel_vits_output_1017756_sad.webm	0_mymodel_vits_output_1017756_happy.webm	0_mymodel_vits_output_1017756_surprise.webm
"The cake is a lie."	1_mymodel_vits_output_1017756_neutral.webm	1_mymodel_vits_output_1017756_angry.webm	1_mymodel_vits_output_1017756_sad.webm	1_mymodel_vits_output_1017756_happy.webm	1_mymodel_vits_output_1017756_surprise.webm
"Do you want to go to the supermarket with me?"	2_mymodel_vits_output_1017756_neutral.webm	2_mymodel_vits_output_1017756_angry.webm	2_mymodel_vits_output_1017756_sad.webm	2_mymodel_vits_output_1017756_happy.webm	2_mymodel_vits_output_1017756_surprise.webm
"I am feeling great today!"	3_mymodel_vits_output_1017756_neutral.webm	3_mymodel_vits_output_1017756_angry.webm	3_mymodel_vits_output_1017756_sad.webm	3_mymodel_vits_output_1017756_happy.webm	3_mymodel_vits_output_1017756_surprise.webm

Robot Filters

The filters were designed by a post production sound designer and applied using a set of Python libraries (kudos to Spotify Pedalboard librart).

Next steps

On the existing architecture, create a robot filter to apply to the final audio
Create or adapt datasets with emotion for training the TTS models
Apply the robot filter to the emotion dataset
Train different models for different moods and personas (notebooks already available to train new models using GlowTTS and VITS)
Add more moods and personas
Use our own GPT model instead of the OpenAI API

Acknowledgements

This project was inspired by the following projects:

OpenAI API
Coquis-TTS
Spotify Pedalboard

pedromspeixoto/bupa