larynx.text_to_speech() function doesn't work

Question

larynx.text_to_speech() function doesn't work

asters-a opened this issue 2 years ago · 1 comments

Hi there,
I can't get the larynx.text_to_speech python function to work. I'm getting these errors and then the audio that plays is just noise:

2023-04-03 14:51:03.305121301 [W:onnxruntime:, execution_frame.cc:835 VerifyOutputSizes] Expected shape from model of {1,80,244} does not match actual shape of {1,80,234} for output 3453
2023-04-03 14:51:03.325693586 [W:onnxruntime:, execution_frame.cc:835 VerifyOutputSizes] Expected shape from model of {-1,-1,244} does not match actual shape of {1,80,234} for output output
2023-04-03 14:51:03.421273662 [W:onnxruntime:, execution_frame.cc:835 VerifyOutputSizes] Expected shape from model of {-1,1,12800} does not match actual shape of {1,1,59904} for output audio

I know I can do a curl call to the larynx server, and it works properly when I do, but I want to use it without needing to run the server. I want to mention that I did have it working properly with a previous version, when the text_to_speech function didn't require the model and vocoder parameters, but I can't make it work with the newer version.

Can anyone help? Here's my test code:

larynx_model = larynx.load_tts_model(TextToSpeechType.GLOW_TTS, "en-us/southern_english_female-glow_tts")
larynx_vocoder = larynx.load_vocoder_model(VocoderType.HIFI_GAN, "hifi_gan/vctk_small")
audio_settings = larynx.AudioSettings()

tts_result = larynx.text_to_speech(
    text="Hello there",
    lang="en",
    tts_model=larynx_model,
    vocoder_model=larynx_vocoder,
    audio_settings=audio_settings
)

for result in tts_result:
    p = pyaudio.PyAudio()
    stream = p.open(
        format=p.get_format_from_width(audio_settings.sample_bytes),
        channels=audio_settings.channels,
        rate=audio_settings.sample_rate,
        output=True
    )
    stream.write(result[1].tobytes())
    stream.stop_stream()
    stream.close()
    p.terminate()

Answer 1 · 2023-05-18T19:01:26.000Z

Please take a look at Piper, the successor to Larynx: https://github.com/rhasspy/piper/