Live Voice Translation with Twilio & OpenAI Realtime

This application demonstrates how to use Twilio and OpenAI's Realtime API for bidirectional voice language translation between a caller and a contact center agent.

The AI Assistant intercepts voice audio from one party, translates it, and speaks the audio in the other party's preferred language. Use of the Realtime API from OpenAI offers significantly reduced latency that is conducive to a natural two-way voice conversation.

See here for a video demo of the real time translation app in action.

Below is a high level architecture diagram of how this application works:

This application uses the following Twilio products in conjuction with OpenAI's Realtime API, orchestrated by this middleware application:

Voice
Studio
Flex
Task Router

Two separate Voice calls are initiated, proxied by this middleware service. The caller is asked to choose their preferred language, then the conversation is queued for the next available agent in Twilio Flex. Once connected to the agent, this middleware intercepts the audio from both parties via Media Streams and forwards to OpenAI Realtime for translation. The translated audio is then forwarded to the other party.

Prerequisites

To get up and running, you will need:

A Twilio Flex Account (create)
An OpenAI Account (sign up) and API Key
A second Twilio phone number (instructions)
Node v20.10.0 or higher (install)
Ngrok (sign up and download)

Local Setup

There are 3 required steps to get the app up-and-running locally for development and testing:

Open an ngrok tunnel
Configure middleware app
Twilio setup

Open ngrok tunnel

When developing & testing locally, you'll need to open an ngrok tunnel that forwards requests to your local development server. This ngrok tunnel is used for the Twilio Media Streams that forward call audio to/from this application.

To spin up an ngrok tunnel, open a Terminal and run:

ngrok http 5050

Once the tunnel has been initiated, copy the Forwarding URL. It will look something like: https://[your-ngrok-subdomain].ngrok.app. You will need this when configuring environment variables for the middleware in the next section.

Note that the ngrok command above forwards to a development server running on port 5050, which is the default port configured in this application. If you override the API_PORT environment variable covered in the next section, you will need to update the ngrok command accordingly.

Keep in mind that each time you run the ngrok http command, a new URL will be created, and you'll need to update it everywhere it is referenced below.

Configure middleware app locally

Clone this repository
Run npm install to install dependencies
Run cp .env.sample .env to create your local environment variables file

Once created, open .env in your code editor. You are required to set the following environment variables for the app to function properly:

Variable Name	Description	Example Value
`NGROK_DOMAIN`	The forwarding URL of your ngrok tunnel initiated above	`[your-ngrok-subdomain].ngrok.app`
`TWILIO_ACCOUNT_SID`	Your Twilio Account SID, which can be found in the Twilio Console.	`ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX`
`TWILIO_AUTH_TOKEN`	Your Twilio Auth Token, which is also found in the Twilio Console.	`your_auth_token_here`
`TWILIO_CALLER_NUMBER`	The additional Twilio phone number you purchased, not connected to Flex. Used for the caller-facing "leg" of the call.	`+18331234567`
`TWILIO_FLEX_NUMBER`	The phone number automatically purchased when provisioning your Flex account. Used for the agent-facing "leg" of the call.	`+14151234567`
`TWILIO_FLEX_WORKFLOW_SID`	The Taskrouter Workflow SID, which is automatically provisioned with your Flex account. Used to enqueue inbound call with Flex agents. To find this, in the Twilio Console go to TaskRouter > Workspaces > Flex Task Assignment > Workflows	`WWXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX`
`OPENAI_API_KEY`	Your OpenAI API Key	`your_api_key_here`

Below are optional environment variables that have default values that can be overridden:

Variable Name	Description	Default Value
`FORWARD_AUDIO_BEFORE_TRANSLATION`	Set to `true` to enable forwarding the original spoken audio between callers. For instance, if Caller is speaking Spanish, this would play the original Spanish audio for the Agent before the translated audio is played. This setting is useful in production contexts to minimize perceived silences. Not recommended for development mode where one person will be simultaneously playing the role of the caller and the agent.	`false`
`API_PORT`	The port your local server runs on.	`5050`

Twilio setup

Import Studio Flow

You'll need to import the included Studio flow in the inbound_language_studio_flow.json file into your Twilio Account, then configure the caller-facing Twilio phone number to use this Flow. This Studio Flow will handle the initial inbound call, and present the caller with a basic IVR to select their preferred language to use in the conversation with the agent.

In the Twilio Console, go to the Studio Flows page and click Create New Flow. Give your Flow a name, like "Inbound Translation IVR", click Next, then select the option to Import from JSON and click Next.

Copy the contents of inbound_language_studio_flow.json and paste it into the textbox. Search for [your-ngrok-subdomain] and replace with your assigned ngrok tunnel subdomain. Click Next to import the Studio Flow, then Publish.

The included Studio Flow will play a prerecorded message for the caller asking them to select their preferred language as either:

English
Spanish
French
Mandarin
Hindi

You can update the Studio Flow logic to change the languages you'd like to support. See here for more information on OpenAI's supported language options.

Point Caller Phone Number to Studio Flow

Once your Studio Flow is imported and published, the next step is to point your inbound / caller-facing phone number (TWILIO_CALLER_NUMBER) to your Studio Flow. In the Twilio Console, go to Phone Numbers > Manage > Active Numbers and click on the additional phone number you purchased (not the one auto-provisioned by Flex).

In your Phone Number configuration settings, update the first A call comes in dropdown to Studio Flow, select the name of the Flow you created above, and click Save configuration.

Point Agent Phone Number and TaskRouter Workspace to Middleware

The last step is to point the agent-facing phone number (TWILIO_FLEX_NUMBER) and the TaskRouter "Flex Task Assignment" Workspace to this middleware app. This is needed to connect the conversation to a contact center agent in Flex.

In the Twilio Console, go to Phone Numbers > Manage > Active Numbers and click on Flex phone number that was auto-provisioned. In your Phone Number configuration settings, update the first A call comes in dropdown to Webhook and set the URL to https://[your-ngrok-subdomain].ngrok.app/outbound-call, ensure HTTP is set to HTTP POST, and click Save configuration. ![Point Agent Phone Number to Middleware]/live-translation-readme-images(/flex-voice-number-webhook.png)

Ensure that you replace [your-ngrok-subdomain] with your assigned ngrok tunnel subdomain.

Then, go to TaskRouter > Workspaces > Flex Task Assignment > Settings, and set the Event callback URL to https://[your-ngrok-subdomain].ngrok.app/reservation-accepted, again replacing [your-ngrok-subdomain] with your assigned ngrok tunnel subdomain.

Finally, under Select events, check the checkbox for Reservation Accepted.

Run the app

Once dependencies are installed, .env is set up, and Twilio is configured properly, run the dev server with the following command:

npm run dev

Testing the app

With the development server running, you may now begin to test the translation app. If you are wanting to test the app by yourself, simulating both the agent and the caller, we recommend setting FORWARD_AUDIO_BEFORE_TRANSLATION to false so you're not hearing duplicative audio.

To answer the call as the agent, you'll need log into the Flex Agent Desktop. The easiest way to do this is go to the Flex Overview page and click Log in with Console. Once the Agent Desktop is loaded, be sure that your Agent status is set to Available by toggling the dropdown in top-right corner of the window. This ensures enqueued tasks will be routed to you.

With your mobile phone, dial the TWILIO_CALLER_NUMBER and make a call (Do not dial the TWILIO_FLEX_NUMBER). You should hear a prompt to select your desired language, and then be connected to Flex. On the Flex Agent Desktop, once a language preference is selected, you should see the call appear as assigned to you. Use Flex to answer the call.

Once connected, you should now be able to speak on one end of the call, and hear the OpenAI translated audio delivered to the other end of the call (and vice-versa). By default, the Agent's language is set to English. The Realtime API will translate audio from the chosen caller language to English, and the agent's English speech to the chosen caller language.

OpenAI Realtime API Settings

Updating Model Instructions

You can update the instructions used to prompt the OpenAI Realtime API in src/prompts.ts. Note that there are two separate connections to the Realtime API, one for the caller, and one for the agent. This allows for more precision and flexibility in the way the translator behaves for both sides of the call. Note that [CALLER_LANGUAGE] is dynamically inserted into the prompt based on the caller's language selection during the initial Studio IVR. The default behavior assumes the agent speaks English.

To change the prompt for the caller, update AI_PROMPT_CALLER. For the agent, update AI_PROMPT_AGENT. The default instructions used for translation are below:

Caller

export const AI_PROMPT_CALLER = `
You are a translation machine. Your sole function is to translate the input text from [CALLER_LANGUAGE] to English.
Do not add, omit, or alter any information.
Do not provide explanations, opinions, or any additional text beyond the direct translation.
You are not aware of any other facts, knowledge, or context beyond translation between [CALLER_LANGUAGE] and English.
Wait until the speaker is done speaking before translating, and translate the entire input text from their turn.
Example interaction:
User: ¿Cuantos días hay en la semana?
Assistant: How many days of the week are there?
User: Tengo dos hermanos y una hermana en mi familia.
Assistant: I have two brothers and one sister in my family.
`;

Agent

export const AI_PROMPT_AGENT = `
You are a translation machine. Your sole function is to translate the input text from English to [CALLER_LANGUAGE].
Do not add, omit, or alter any information.
Do not provide explanations, opinions, or any additional text beyond the direct translation.
You are not aware of any other facts, knowledge, or context beyond translation between English and [CALLER_LANGUAGE].
Wait until the speaker is done speaking before translating, and translate the entire input text from their turn.
Example interaction:
User: How many days of the week are there?
Assistant: ¿Cuantos días hay en la semana?
User: I have two brothers and one sister in my family.
Assistant: Tengo dos hermanos y una hermana en mi familia.
`;

Sequence Diagram

The eventual flow of the application is as follows.

In this diagram, Voice/Studio has colloquially been used to represent the Twilio Voice and Studio.
The Agent represents the human agent who will be connected to the call via Twilio Flex.

sequenceDiagram
    actor Customer
    participant Voice/Studio
    participant BMV
    participant S2S
    actor Agent

    Customer ->> Voice/Studio: Initiates Call
    Voice/Studio -->> Customer: <Say>Welcome to Be My Voice.<br>Your call will be transferred to an AI Assistant.<br>What language would you like to use?</Say><br><Gather ...>
    Customer -->> Voice/Studio: (Customer selects language)
    
    Voice/Studio ->> +BMV: [HTTP] POST /incoming-call
    BMV -->> -Voice/Studio: <Say>...</Say><br><Connect><Stream ... /></Connect>
    Voice/Studio -->> Customer: <Say>Please wait while we connect you.</Say>
    
    Customer ->> +BMV: [WS] Initiate Media Stream
    activate Customer
    activate BMV
    activate S2S
    BMV ->> +S2S: [WS] Establish Websocket Connection to OpenAI

    BMV ->> Voice/Studio: [HTTP] Create Call (to Agent)<br>with TwiML <Connect><Stream ... /></Connect>
    Voice/Studio -->> Agent: Incoming Task
    Agent ->> BMV: [WS] Establish Websocket Connection
    activate Agent
    Agent ->>+ BMV: [HTTP] Accept Task
    BMV -->>- Agent: Ok 200
    note right of BMV: BMV is now intercepting both <br>Agent and Customer Media Stream
    note right of BMV: For every Media that comes, stream the data to S2S<br>and stream the response back to Agent/Customer
    note right of BMV: For example, it may look something like
    
    loop A conversation loop
    Customer ->> BMV: [WS] (Speaks in their language)
    BMV ->> S2S: [WS] Stream audio in original language
    S2S -->> BMV: [WS] Audio stream in English
    BMV ->> Agent: [WS] Steam audio to Agent in English
    Agent -->> BMV: [WS] (Replies in English)
    BMV ->> S2S: [WS] Stream audio in English language
    S2S -->> BMV: [WS] Audio stream in original language
    BMV ->> Customer: [WS] Stream audio to Customer in original language
    end 


    note right of BMV: At some point, the conversation over<br>and the Customer hangs up
    BMV -->> Customer: [WS] Close
    deactivate Customer

    BMV -->> S2S: [WS] Close
    deactivate S2S
    BMV -->> Agent: [WS] Close
    deactivate BMV
    deactivate Agent

twilio-samples/live-translation-openai-realtime-api