toverainc/willow

How to send TTS reply sentence by sentence for longer text

Opened this issue · 1 comments

zmarty commented

I have some custom REST API server code that currently replies with some string, and using WIS TTS the string gets read out on the ESP32 box. This is great for short replies, but not so great if the LLM needs to read a long text. Would it be somehow possible to send responses sentence by sentence as it gets streamed from the LLM I use? Something similar to how the voice version of ChatGPT gradually streams back the reply?

zmarty commented

Maybe related observation: by default my C# ASP.NET API uses Transfer-Encoding: chunked and it does not return a Content-Length header. In that case willow just reads aloud "Success" instead of the body I send, because it fails to determine the length. If I change my code to force it to send Content-Length, then it reads the body correctly.

This got me thinking... could my request above be implemented using chunked transfer encoding?

Something like this proposal from GPT-4:

[HttpGet("stream")]
public async Task StreamResponse()
{
    Response.Headers.Add("Transfer-Encoding", "chunked");
    foreach (var part in GetDataParts())
    {
        await Response.WriteAsync(part);
        await Response.Body.FlushAsync(); // Important to flush the stream
        // Simulate some real-time delay or processing
        await Task.Delay(1000);
    }
}

private IEnumerable<string> GetDataParts()
{
    yield return "Part 1 ";
    yield return "Part 2 ";
    yield return "Part 3 ";
}

The difficulty is that then the ESP box would need to keep contacting the inference server to get audio for each separate sentence as in comes in.