jinglescode/papers

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

jinglescode opened this issue 4 years ago · 0 comments

jinglescode commented 4 years ago

Paper

Link: https://arxiv.org/pdf/2106.09660.pdf
Year: 2021

Summary

text-to-speech synthesis, synthesizes the waveform directly without using hand-designed intermediate features (e.g., spectrograms)

Methods

3 modules

encoder: sequence input, extracts representations
resampling: match input to output
decoder: generate waveform

encoder:

3 conv + batchnorm + dropout
LSTM
zoneout regularization

resampling

Gaussian upsampling introduced in the non-attentive Tacotron

decoder

consist upsampling blocks and downsampling blocks

Results

tradeoff between fidelity and speed by varying the number of refinement steps
experiments demonstrate that WaveGrad 2 is capable of generating high fidelity audio, comparable to strong baselines