jinglescode/papers

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/pdf/2106.09660.pdf
Year: 2021

Summary

  • text-to-speech synthesis, synthesizes the waveform directly without using hand-designed intermediate features (e.g., spectrograms)

Methods

3 modules

  • encoder: sequence input, extracts representations
  • resampling: match input to output
  • decoder: generate waveform

encoder:

  • 3 conv + batchnorm + dropout
  • LSTM
  • zoneout regularization

resampling

  • Gaussian upsampling introduced in the non-attentive Tacotron

decoder

  • consist upsampling blocks and downsampling blocks

Results

  • tradeoff between fidelity and speed by varying the number of refinement steps
  • experiments demonstrate that WaveGrad 2 is capable of generating high fidelity audio, comparable to strong baselines