erew123/alltalk_tts

Streaming mode not working on Firefox

Closed this issue · 4 comments

Streaming mode didn't play sound on Firefox, but it worked on Chrome.

Hi @yamatazen

I've spent a good 2+ hours looking at this. I've tested Firefox with Windows and Linux and the same issue persists. Looking in the Developer console (typically F12 on a web browser) you can see:

Media resource http://127.0.0.1:7851/api/tts-generate-streaming?text=This+is+a+test+of+streaming+audio&voice=arnold.wav&language=en&output_file=demo_output.wav&streaming=true could not be decoded, error: Error Code: NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006)

image

I've hunted around the internet for NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006) and this appears to be a long standing issue with Firefox, going back as far as 15 years on Mozilla's own bugs/issues pages. For a few things like Ogg formats they provided some fixes and some other formats have just never been resolved (for whatever reason). You can find plenty of hits even up to this year with people suffering this issue.

The typical response I can see from developers who looked into this is along the lines of, this is a Firefox specific issue and either Mozilla need to fix it OR you have to transcode all your media.

Best I can tell from looking at it, Firefox either isn't good at handling different bit depths of audio, or may be too strict. There are no settings in the backend of Firefox that can be changed to resolve this (and I tried quite a few). The strange thing is that the wav audio produced by the TTS scripts for streaming, is exactly the same wav format it uses to generate a wav file when not streaming. So bit depth, encoding etc is all the same, its just that Firefox doesn't want to handle it.

So, I don't see any way I can resolve this:

  1. I cannot transcode because the streaming audio generation request for TTS is direct into the TTS scripts/engine and returned directly to the request source. Additionally, transcoding would double the performance hit and kind of negate streaming performance gains anyway.
  2. The actual scripts/code that encode the wav are not controlled/owned/managed by myself, so I cant tinker in them and change anything. Though the few possible settings I do have access to, I have tested alternative settings (see code snippet below) and unfortunately, there is no success to be had there. Have also confirmed the content headers are correctly generated and sent to Firefox (as per the above images from the developer console).
  3. Mozilla, seem resistant to resolving the issues, based on years of people requesting bug fixes for various media types. Why this is the case I cant say, but, it just appears the way it is with them.
            file_chunks = []
            wav_buf = io.BytesIO()
            with wave.open(wav_buf, "wb") as vfout:
                vfout.setnchannels(1)
                vfout.setsampwidth(2)
                vfout.setframerate(24000)
                vfout.writeframes(b"")
            wav_buf.seek(0)
            yield wav_buf.read()

            for i, chunk in enumerate(output):
                file_chunks.append(chunk)
                if isinstance(chunk, list):
                    chunk = torch.cat(chunk, dim=0)
                chunk = chunk.clone().detach().cpu().numpy()
                chunk = chunk[None, : int(chunk.shape[0])]
                chunk = np.clip(chunk, -1, 1)
                chunk = (chunk * 32767).astype(np.int16)
                yield chunk.tobytes()

As such, unless someone else has any bright ideas for solutions, or Mozilla change something, I will have to just leave this as Firefox doesn't support streaming and using Chrome, Edge or basically another browser is the solution currently.

Sorry and thanks.

So this is a browser issue. I see.

Sorry to butt in this closed issue, I have been looking into making my own little server for Piper and used that code snippets as reference. I have since learned that the way the code generate the audio for streaming creates a malformed WAV file.

From Wikipedia, WAV expects in it header 4 bytes denoting the size of the sample.

The code below create the header and stream it to the browser first before streaming the sample. However, this means that the header for the sample size is zero.

            file_chunks = []
            wav_buf = io.BytesIO()
            with wave.open(wav_buf, "wb") as vfout:
                vfout.setnchannels(1)
                vfout.setsampwidth(2)
                vfout.setframerate(24000)
                vfout.writeframes(b"")
            wav_buf.seek(0)
            yield wav_buf.read()

WAV is not a suitable format for on-the-fly generation and streaming, since it requires knowing the sample size ahead of time. But the size is not known ahead of time with the way the sample is generated and streamed.

In my testing, the generated WAV file via streaming works in VLC and MPV, but the audio players have to guess the length of the audio, since no sample size is provided. The malformed WAV can also crash Audacity.

@TheBill2001 You are correct in the case of Piper, however the code I have for Piper I know does NOT support streaming and it is disabled elsewhere in the Piper models settings, which means it blocks use of that code.

image

However, due to the way the model is called there has to be some pseudo code for streaming otherwise it creates other errors when you try to call it as an async function, which causes errors elsewhere.

I would suggest referencing the Piper site for documentation/code on creating Piper as a streaming setup.

The code I have there is only correct for streaming the XTTS AI model, and, as mentioned, used a pseudo code to make Python happy about the function call as an async process.

Thanks