jinglescode/papers

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/pdf/1904.02619.pdf
Year: 2019

Summary

fully convolutional encoder and a simple decoder can give superior results to a strong RNN baseline while being an order of magnitude more efficient. Key to the success of the convolutional encoder is a time-depth separable block structure which allows the model to retain a large receptive field

image

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our
model improves WER on LibriSpeech while being an order of
magnitude more efficient than a strong RNN baseline. Key to
our approach is a time-depth separable convolution block which
dramatically reduces the number of parameters in the model
while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows
us to effectively integrate a language model. Coupled with a
convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER
over the best previously reported sequence-to-sequence results
on the noisy LibriSpeech test set.