jinglescode/papers

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/pdf/2002.02562.pdf
Year: 2020

Summary

  • use the attention in Transformer-XL and apply to speech recognition
  • end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system

Contributions and Distinctions from Previous Works

  • replacing the LSTM encoders with Transformers
  • Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework. But unlike CTC, this label distribution is also conditioned on the previous label history

Methods

image

  • RNN-T architecture parameterizes P(z|x) with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively
  • In addition, so as to model sequential order, we use the relative positional encoding proposed in Transformer-xl. With relative positional encoding, the encoding only affects the attention score instead of the Values being summed.

Comments

Conformer: Convolution-augmented Transformer for Speech Recognition shows they beat their performance

image