This application demonstrates how to use Twilio and OpenAI's Realtime API for bidirectional voice language translation between a caller and a contact center agent.
The AI Assistant intercepts voice audio from one party, translates it, and speaks the audio in the other party's preferred language. Use of the Realtime API from OpenAI offers significantly reduced latency that is conducive to a natural two-way voice conversation.
See here for a video demo of the real time translation app in action.
Below is a high level architecture diagram of how this application works:
This application uses the following Twilio products in conjuction with OpenAI's Realtime API, orchestrated by this middleware application:
- Voice
- Studio
- Flex
- Task Router
Two separate Voice calls are initiated, proxied by this middleware service. The caller is asked to choose their preferred language, then the conversation is queued for the next available agent in Twilio Flex. Once connected to the agent, this middleware intercepts the audio from both parties via Media Streams and forwards to OpenAI Realtime for translation. The translated audio is then forwarded to the other party.
To get up and running, you will need:
- A Twilio Flex Account (create)
- An OpenAI Account (sign up) and API Key
- A second Twilio phone number (instructions)
- Node v20.10.0 or higher (install)
- Ngrok (sign up and download)
There are 3 required steps to get the app up-and-running locally for development and testing:
- Open an ngrok tunnel
- Configure middleware app
- Twilio setup
When developing & testing locally, you'll need to open an ngrok tunnel that forwards requests to your local development server. This ngrok tunnel is used for the Twilio Media Streams that forward call audio to/from this application.
To spin up an ngrok tunnel, open a Terminal and run:
ngrok http 5050
Once the tunnel has been initiated, copy the Forwarding
URL. It will look something like: https://[your-ngrok-subdomain].ngrok.app
. You will
need this when configuring environment variables for the middleware in the next section.
Note that the ngrok
command above forwards to a development server running on port 5050
, which is the default port configured in this application. If
you override the API_PORT
environment variable covered in the next section, you will need to update the ngrok
command accordingly.
Keep in mind that each time you run the ngrok http
command, a new URL will be created, and you'll need to update it everywhere it is referenced below.
- Clone this repository
- Run
npm install
to install dependencies - Run
cp .env.sample .env
to create your local environment variables file
Once created, open .env
in your code editor. You are required to set the following environment variables for the app to function properly:
Variable Name | Description | Example Value |
---|---|---|
NGROK_DOMAIN |
The forwarding URL of your ngrok tunnel initiated above | [your-ngrok-subdomain].ngrok.app |
TWILIO_ACCOUNT_SID |
Your Twilio Account SID, which can be found in the Twilio Console. | ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |
TWILIO_AUTH_TOKEN |
Your Twilio Auth Token, which is also found in the Twilio Console. | your_auth_token_here |
TWILIO_CALLER_NUMBER |
The additional Twilio phone number you purchased, not connected to Flex. Used for the caller-facing "leg" of the call. | +18331234567 |
TWILIO_FLEX_NUMBER |
The phone number automatically purchased when provisioning your Flex account. Used for the agent-facing "leg" of the call. | +14151234567 |
TWILIO_FLEX_WORKFLOW_SID |
The Taskrouter Workflow SID, which is automatically provisioned with your Flex account. Used to enqueue inbound call with Flex agents. To find this, in the Twilio Console go to TaskRouter > Workspaces > Flex Task Assignment > Workflows | WWXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |
OPENAI_API_KEY |
Your OpenAI API Key | your_api_key_here |
Below are optional environment variables that have default values that can be overridden:
Variable Name | Description | Default Value |
---|---|---|
FORWARD_AUDIO_BEFORE_TRANSLATION |
Set to true to enable forwarding the original spoken audio between callers. For instance, if Caller is speaking Spanish, this would play the original Spanish audio for the Agent before the translated audio is played. This setting is useful in production contexts to minimize perceived silences. Not recommended for development mode where one person will be simultaneously playing the role of the caller and the agent. |
false |
API_PORT |
The port your local server runs on. | 5050 |
You'll need to import the included Studio flow in the inbound_language_studio_flow.json file into your Twilio Account, then configure the caller-facing Twilio phone number to use this Flow. This Studio Flow will handle the initial inbound call, and present the caller with a basic IVR to select their preferred language to use in the conversation with the agent.
In the Twilio Console, go to the Studio Flows page and click Create New Flow. Give your Flow a name, like "Inbound Translation IVR", click Next, then select the option to Import from JSON and click Next.
Copy the contents of inbound_language_studio_flow.json and paste it into the textbox. Search for [your-ngrok-subdomain]
and replace with your assigned ngrok tunnel subdomain. Click Next to import the Studio Flow, then Publish.
The included Studio Flow will play a prerecorded message for the caller asking them to select their preferred language as either:
- English
- Spanish
- French
- Mandarin
- Hindi
You can update the Studio Flow logic to change the languages you'd like to support. See here for more information on OpenAI's supported language options.
Once your Studio Flow is imported and published, the next step is to point your inbound / caller-facing phone number (TWILIO_CALLER_NUMBER
) to your Studio Flow. In the Twilio Console, go to Phone Numbers > Manage > Active Numbers and click on the additional phone number you purchased (not the one auto-provisioned by Flex).
In your Phone Number configuration settings, update the first A call comes in dropdown to Studio Flow, select the name of the Flow you created above, and click Save configuration.
The last step is to point the agent-facing phone number (TWILIO_FLEX_NUMBER
) and the TaskRouter "Flex Task Assignment" Workspace to this middleware app. This is needed to connect the conversation to a contact center agent in Flex.
In the Twilio Console, go to Phone Numbers > Manage > Active Numbers and click on Flex phone number that was auto-provisioned. In your Phone Number configuration settings, update the first A call comes in dropdown to Webhook and set the URL to https://[your-ngrok-subdomain].ngrok.app/outbound-call
, ensure HTTP is set to HTTP POST, and click Save configuration.
![Point Agent Phone Number to Middleware]/live-translation-readme-images(/flex-voice-number-webhook.png)
Ensure that you replace [your-ngrok-subdomain]
with your assigned ngrok tunnel subdomain.
Then, go to TaskRouter > Workspaces > Flex Task Assignment > Settings, and set the Event callback URL to https://[your-ngrok-subdomain].ngrok.app/reservation-accepted
, again replacing [your-ngrok-subdomain]
with your assigned ngrok tunnel subdomain.
Finally, under Select events, check the checkbox for Reservation Accepted.
Once dependencies are installed, .env
is set up, and Twilio is configured properly, run the dev server with the following command:
npm run dev
With the development server running, you may now begin to test the translation app. If you are wanting to test the app by yourself, simulating both the agent and the caller, we recommend setting FORWARD_AUDIO_BEFORE_TRANSLATION
to false
so you're not hearing duplicative audio.
To answer the call as the agent, you'll need log into the Flex Agent Desktop. The easiest way to do this is go to the Flex Overview page and click Log in with Console. Once the Agent Desktop is loaded, be sure that your Agent status is set to Available by toggling the dropdown in top-right corner of the window. This ensures enqueued tasks will be routed to you.
With your mobile phone, dial the TWILIO_CALLER_NUMBER
and make a call (Do not dial the TWILIO_FLEX_NUMBER
). You should hear a prompt to select your desired language, and then be connected to Flex. On the Flex Agent Desktop, once a language preference is selected, you should see the call appear as assigned to you. Use Flex to answer the call.
Once connected, you should now be able to speak on one end of the call, and hear the OpenAI translated audio delivered to the other end of the call (and vice-versa). By default, the Agent's language is set to English. The Realtime API will translate audio from the chosen caller language to English, and the agent's English speech to the chosen caller language.
You can update the instructions used to prompt the OpenAI Realtime API in src/prompts.ts
. Note that there are two separate connections to the Realtime API, one for the caller, and one for the agent. This allows for more precision and flexibility in the way the translator behaves for both sides of the call. Note that [CALLER_LANGUAGE]
is dynamically inserted into the prompt based on the caller's language selection during the initial Studio IVR. The default behavior assumes the agent speaks English.
To change the prompt for the caller, update AI_PROMPT_CALLER
. For the agent, update AI_PROMPT_AGENT
. The default instructions used for translation are below:
Caller
export const AI_PROMPT_CALLER = `
You are a translation machine. Your sole function is to translate the input text from [CALLER_LANGUAGE] to English.
Do not add, omit, or alter any information.
Do not provide explanations, opinions, or any additional text beyond the direct translation.
You are not aware of any other facts, knowledge, or context beyond translation between [CALLER_LANGUAGE] and English.
Wait until the speaker is done speaking before translating, and translate the entire input text from their turn.
Example interaction:
User: ¿Cuantos días hay en la semana?
Assistant: How many days of the week are there?
User: Tengo dos hermanos y una hermana en mi familia.
Assistant: I have two brothers and one sister in my family.
`;
Agent
export const AI_PROMPT_AGENT = `
You are a translation machine. Your sole function is to translate the input text from English to [CALLER_LANGUAGE].
Do not add, omit, or alter any information.
Do not provide explanations, opinions, or any additional text beyond the direct translation.
You are not aware of any other facts, knowledge, or context beyond translation between English and [CALLER_LANGUAGE].
Wait until the speaker is done speaking before translating, and translate the entire input text from their turn.
Example interaction:
User: How many days of the week are there?
Assistant: ¿Cuantos días hay en la semana?
User: I have two brothers and one sister in my family.
Assistant: Tengo dos hermanos y una hermana en mi familia.
`;
The eventual flow of the application is as follows.
- In this diagram,
Voice/Studio
has colloquially been used to represent the Twilio Voice and Studio. - The
Agent
represents the human agent who will be connected to the call via Twilio Flex.
sequenceDiagram
actor Customer
participant Voice/Studio
participant BMV
participant S2S
actor Agent
Customer ->> Voice/Studio: Initiates Call
Voice/Studio -->> Customer: <Say>Welcome to Be My Voice.<br>Your call will be transferred to an AI Assistant.<br>What language would you like to use?</Say><br><Gather ...>
Customer -->> Voice/Studio: (Customer selects language)
Voice/Studio ->> +BMV: [HTTP] POST /incoming-call
BMV -->> -Voice/Studio: <Say>...</Say><br><Connect><Stream ... /></Connect>
Voice/Studio -->> Customer: <Say>Please wait while we connect you.</Say>
Customer ->> +BMV: [WS] Initiate Media Stream
activate Customer
activate BMV
activate S2S
BMV ->> +S2S: [WS] Establish Websocket Connection to OpenAI
BMV ->> Voice/Studio: [HTTP] Create Call (to Agent)<br>with TwiML <Connect><Stream ... /></Connect>
Voice/Studio -->> Agent: Incoming Task
Agent ->> BMV: [WS] Establish Websocket Connection
activate Agent
Agent ->>+ BMV: [HTTP] Accept Task
BMV -->>- Agent: Ok 200
note right of BMV: BMV is now intercepting both <br>Agent and Customer Media Stream
note right of BMV: For every Media that comes, stream the data to S2S<br>and stream the response back to Agent/Customer
note right of BMV: For example, it may look something like
loop A conversation loop
Customer ->> BMV: [WS] (Speaks in their language)
BMV ->> S2S: [WS] Stream audio in original language
S2S -->> BMV: [WS] Audio stream in English
BMV ->> Agent: [WS] Steam audio to Agent in English
Agent -->> BMV: [WS] (Replies in English)
BMV ->> S2S: [WS] Stream audio in English language
S2S -->> BMV: [WS] Audio stream in original language
BMV ->> Customer: [WS] Stream audio to Customer in original language
end
note right of BMV: At some point, the conversation over<br>and the Customer hangs up
BMV -->> Customer: [WS] Close
deactivate Customer
BMV -->> S2S: [WS] Close
deactivate S2S
BMV -->> Agent: [WS] Close
deactivate BMV
deactivate Agent