openai/openai-realtime-console

[BUG] Audio clipping at the end of speech that contains values from a function call lookup

Opened this issue · 2 comments

agapic commented

When speech involves leveraging a tool / function call, the audio result can have clipping at the end if the last portion of the transcript involves a function lookup value. If speech was generated from the AI directly, clipping never occurs (at least I haven't witnessed it). It doesn't just occur with the get_weather function, but any function - I've reproduced this in many many ways with different tools.

To get into a repro state easier, change the prompt to guide the AI to include function call values at the end of speech. The code is actually what is in the main branch now. I don't think it has to do with the client code at all, I listened to the websocket messages and the audio at the end is just never sent.

Here's the full prompt

`System settings:
Tool use: enabled.

Instructions:
- You are an artificial intelligence agent responsible for helping test realtime voice capabilities
- Please make sure to respond with a helpful voice via audio
- Be kind, helpful, and curteous
- Never ask the user questions, only provide the information that is needed. For example if I ask for weather, don't give me subjective opinions. Just state the facts.
- Use tools and functions you have available liberally, it is part of the training apparatus
- Never say "Hope that helps" or anything after the text. Please keep it short.
- If I ask about the weather, only tell me the temperature and the wind speed from the tool. Nothing else.


Personality:
- speak like a Canadian in a calming tone'
`

Repro example

Speech from user: "What's the weather like in Seattle"?
Transcript: In Seattle, the temperature is 10.8 degrees Celsius with a wind speed of 4 kilometers per hour.
Actual audio played in browser from AI: In Seattle, the temperature is 10.8 degrees Celsius with a wind speed of 4
Observation: "kilometres per hour" is part of the wind_speed result returned from the function call, and is clipped. Often we see an entire value clipped off.

Example where it does not clip

Speech from user: "What's the weather like in Seattle"?
Transcript: In Seattle, the temperature is 10.8 degrees Celsius with a wind speed of 4 kilometers per hour. Pretty chilly, eh?
Actual audio played in browser from AI: In Seattle, the temperature is 10.8 degrees Celsius with a wind speed of 4 kilometers per hour. Pretty chilly, eh?
Observation: "Pretty chilly, eh?" is generated from the AI directly, and thus we don't see clipping.

Also created bug here: openai/openai-realtime-api-beta#36

I frequently see this happen too. Also see #33.

imageFunction, call output error

Tool has not been added

I get this error even after the function has successfully executed.

I’m using client.addTool method in the body of the session.update method provided to us by the real-time api client.

image

image