Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

Question

jinglescode opened this issue 5 years ago · 0 comments

Paper

use the attention in Transformer-XL and apply to speech recognition
end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system

replacing the LSTM encoders with Transformers
Unlike a typical attention-based sequence-to-sequence model, which attends over the entire input for every prediction in the output sequence, the RNN-T model gives a probability distribution over the label space at every time step, and the output label space includes an additional null label to indicate the lack of output for that time step — similar to the Connectionist Temporal Classification (CTC) framework. But unlike CTC, this label distribution is also conditioned on the previous label history

RNN-T architecture parameterizes P(z|x) with an audio encoder, a label encoder, and a joint network. The encoders are two neural networks that encode the input sequence and the target output sequence, respectively
In addition, so as to model sequential order, we use the relative positional encoding proposed in Transformer-xl. With relative positional encoding, the encoding only affects the attention score instead of the Values being summed.