kan-bayashi/ParallelWaveGAN

Low inference speed of TTS on GPU

dalvlv opened this issue · 2 comments

May I ask why the RTF of TTS is only 0.09 for a 12-seconds sentence? I use fastspeech2_HIFiGAN model and GPU is A2000 (8.0 capability). I thought it should be 50x speedup at least. Because the paper of fastpeech2 says it has 50x than transformer and HifiGAN says it speed up 1000x. So can anyone tells me what's wrong?
Thank you!

Based on my experiments, it should be a bit more faster.
Using Nvidia T4, CFS2 + HiFiGAN V1 resulted in RTF = 0.008 (250 utts averaged.)
Could you paste your pseudo code to calculate RTF?

Hi @kan-bayashi ,
Thank you for answering my question!
I use the colab codes of this respository. The code is below:

with torch.no_grad():
 start = time.time()
 out_t2s = text2speech(x)
 wav = out_t2s["wav"]
    rtf = (time.time() - start) / (len(wav) / text2speech.fs)
    print(f"RTF = {rtf:5f}")

I tests 3 parts of the inference for a Chinese 24s sentence:
part 1: preprocess : 387ms
text = self.preprocess_fn("<dummy>", dict(text=text))["text"]
part2: CFS2 model: 701ms
part3: HiFiGAN: 24ms
RTF: sum-parts / 24s = 0.046
(yes, for longer sentences, RTF is faster because of FastSpeech)
It seems the preprocess&CFS2 costs most of the time.