Low inference speed of TTS on GPU

Question

Low inference speed of TTS on GPU

dalvlv opened this issue 2 years ago · 2 comments

May I ask why the RTF of TTS is only 0.09 for a 12-seconds sentence? I use fastspeech2_HIFiGAN model and GPU is A2000 (8.0 capability). I thought it should be 50x speedup at least. Because the paper of fastpeech2 says it has 50x than transformer and HifiGAN says it speed up 1000x. So can anyone tells me what's wrong?
Thank you！

Answer 1 · 2022-07-21T02:45:57.000Z

Based on my experiments, it should be a bit more faster.
Using Nvidia T4, CFS2 + HiFiGAN V1 resulted in RTF = 0.008 (250 utts averaged.)
Could you paste your pseudo code to calculate RTF?

Answer 2 · 2022-07-22T08:11:15.000Z

Hi @kan-bayashi ,
Thank you for answering my question!
I use the colab codes of this respository. The code is below:

with torch.no_grad():
 start = time.time()
 out_t2s = text2speech(x)
 wav = out_t2s["wav"]
    rtf = (time.time() - start) / (len(wav) / text2speech.fs)
    print(f"RTF = {rtf:5f}")

I tests 3 parts of the inference for a Chinese 24s sentence:
part 1: preprocess : 387ms
text = self.preprocess_fn("<dummy>", dict(text=text))["text"]
part2: CFS2 model: 701ms
part3: HiFiGAN: 24ms
RTF: sum-parts / 24s = 0.046(yes, for longer sentences, RTF is faster because of FastSpeech)
It seems the preprocess&CFS2 costs most of the time.