
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.


WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)

To be updated with Interspeech 2015...



(Possibly trained on more data than LibriSpeech.)

WER test-clean WER test-other Paper Notes
5.51% 13.97% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books HMM-DNN + pNorm*
8.01% 22.49% same, Kaldi HMM-(SAT)GMM
12.51% Audio Augmentation for Speech Recognition TDNN + pNorm + speed up/down speech


(Possibly trained on more data than WSJ.)

WER eval'92 WER eval'93 Paper Notes
3.63% 5.66% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*
5.6% Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal CNN over RAW speech (wav)

Switchboard Hub5'00

(Possibly trained on more data than SWB, but test set = full Hub5'00.)

WER (SWB) WER (full=SWB+CH) Paper Notes
12.6% 16% Deep Speech: Scaling up end-to-end speech recognition CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB
12.6% 18.4% Sequence-discriminative training of deep neural networks HMM-DNN +sMBR
12.9% 19.3% Audio Augmentation for Speech Recognition TDNN + pNorm + speed up/down speech
15% 19.1% Building DNN Acoustic Models for Large Vocabulary Speech Recognition DNN + Dropout
10.4% Joint Training of Convolutional and Non-Convolutional Neural Networks CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN
11.5% Deep Convolutional Neural Networks for LVCSR CNN



(So far, all results trained on TIMIT and tested on the standard test set.)

PER Paper Notes
16.7% Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition CNN in time and frequency + dropout, 17.6% w/o dropout
17.6% Attention-Based Models for Speech Recognition Bi-RNN + Attention
17.7% Speech Recognition with Deep Recurrent Neural Networks Bi-LSTM + skip connections w/ CTC
23% Deep Belief Networks for Phone Recognition (first, modern) HMM-DBN



Noise-robust ASR


BigCorp™®-specific dataset



  • WER: word error rate
  • PER: phone error rate
  • LM: language model
  • HMM: hidden markov model
  • GMM: Gaussian mixture model
  • DNN: deep neural network
  • CNN: convolutional neural network
  • DBN: deep belief network (RBM-based DNN)
  • RNN: recurrent neural network
  • LSTM: long short-term memory
  • CTC: connectionist temporal classification
  • MMI: maximum mutual information (MMI),
  • MPE: minimum phone error
  • sMBR: state-level minimum Bayes risk
  • SAT: speaker adaptive training
  • MLLR: maximum likelihood linear regression
  • LDA: (in this context) linear discriminant analysis
  • MFCC: Mel frequency cepstral coefficients
  • FB/FBANKS/MFSC: Mel frequency spectral coefficients