An effort to track benchmarking results over widely-used datasets for ASR (Automatic Speech Recognition). Note that the ASR results are affected by a number of factors, it is thus important to report results along with those factors for fair comparisons. In this way, we can measure the progress in a more scientifically way. Feel free to add and correct!
Terms | Explanations |
---|---|
ATT | Attention based Seq2Seq, including LAS (Listen Attend and Spell). |
NAS | Neural Architecture Search |
Unit | phone (namely monophone), biphone, triphone, wp (word-piece), character, chenone, BPE (byte-pair encoding) |
AM | Acoustic Model. Options: DNN-HMM / CTC / ATT / ATT+CTC / RNN-T / CTC-CRF. Note that: we list some end-to-end (e2e) models (e.g., ATT, RNN-T) in this field, although those e2e models contains an implicit/internal LM through the encoder. |
AM size (M) | The number of parameters in millions in the Acoustic Model. Also we report the total number of parameters in e2e models in this field. |
LM | Language Model, explicitly used, word-level (by default). ''---'' denotes not using shallow fusion with explicit/external LMs, particularly for ATT, RNN-T. |
LM size (M) | The number of parameters in millions in the neural Language Model. For n-gram LMs, this field denotes the total number of n-gram features. |
Data Aug. | whether any forms of data augmentations are used, such as SP (3-fold Speech Perturbation from Kaldi), SA (SpecAugment) |
Ext. Data | whether any forms of external data (either speech data or text corpus) are used |
WER | Word Error Rate |
CER | Character Error Rate |
L | #Layer, e.g., L24 denotes that the number of layers is 24 |
--- | not applied |
? | not known from the original paper |
This dataset contains about 80 hours of training data, consisting of read sentences from the Wall Street Journal, recorded under clean conditions. Available from the LDC as WSJ0 under the catalog number LDC93S6B.
The evaluation dataset contains the simpler eval92 subset and the harder dev93 subset.
Results are sorted by eval92
WER.
eval92 WER | dev93 WER | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|---|
2.50 | 5.48 | mono-phone | CTC-CRF, deformable TDNN | 11.9 | 4-gram | 2.59 | SP | --- | Deformable TDNN |
2.7 | 5.3 | bi-phone | LF-MMI, TDNN-LSTM | ? | 4-gram | ? | SP | --- | LF-MMI TASLP2018 |
2.77 | 5.68 | mono-phone | CTC-CRF, TDNN NAS | 11.9 | 4-gram | 2.59 | SP | --- | NAS SLT2021 |
3.0 | 6.0 | bi-phone | EE-LF-MMI, TDNN-LSTM | ? | 4-gram | ? | SP | --- | EE-LF-MMI TASLP2018 |
3.2 | 5.7 | mono-phone | CTC-CRF, VGG-BLSTM | 16 | 4-gram | 2.59 | SP | --- | CAT IS2020 |
3.4 | 5.9 | sub-word | ATT, LSTM | 18 | RNN | 113 | --- | --- | ESPRESSO ASRU2019 |
3.79 | 6.23 | mono-phone | CTC-CRF, BLSTM | 13.5 | 4-gram | 2.59 | SP | --- | CTC-CRF ICASSP2019 |
4.9 | --- | mono-char | ATT+CTC, Transformers | ? | 4-gram | ? | SA | --- | phoneBPE-IS2020 |
5.0 | 8.1 | mono-char | CTC-CRF, VGG-BLSTM | 16 | 4-gram | 2.59 | SP | --- | CAT IS2020 |
This dataset contains about 260 hours of English telephone conversations between two strangers on a preassigned topic (LDC97S62). The testing is commonly conducted on eval2000 (a.k.a. hub5'00 evaluation, LDC2002S09 for speech data and LDC2002T43 for transcripts), which consists of two test subsets - Switchboard (SW) and CallHome (CH).
Results in square brackets denote the weighted average over SW and CH based on our calculation when not reported in the original paper.
Results are sorted by Sum
WER.
SW | CH | Sum | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|---|---|
6.3 | 13.3 | [9.8] | charBPE &phoneBPE | ATT+CTC, Transformers, L24 enc, L12 dec | ? | multi-level RNNLM | ? | SA | Fisher transcripts | phoneBPE-IS2020 |
6.4 | 13.4 | 9.9 | char | RNN-T, BLSTM-LSTM, ivector | 57 | LSTM | 84 | SP, SA, etc. | Fisher transcripts | Advancing RNN-T ICASSP2021 |
6.5 | 13.9 | 10.2 | phone | LF-MMI, TDNN-f | ? | Transformer | 25 | SP | Fisher transcripts | P-Rescoring ICASSP2021 |
6.8 | 14.1 | [10.5] | wp 1k | ATT | ? | LSTM | ? | SA | Fisher transcripts | SpecAug IS2019 |
6.9 | 14.5 | 10.7 | phone | CTC-CRF Conformer | 51.82 | Transformer | 25 | SP, SA | Fisher transcripts | AdvancingCTC-CRF |
7.2 | 14.8 | 11.1 | wp | CTC-CRF Conformer | 51.85 | Transformer | 25 | SP, SA | Fisher transcripts | AdvancingCTC-CRF |
7.9 | 15.7 | 11.8 | char | RNN-T BLSTM-LSTM | 57 | LSTM | 5 | SP, SA, etc. | --- | Advancing RNN-T ICASSP2021 |
8.3 | 17.1 | [12.7] | bi-phone | LF-MMI, TDNN-LSTM | ? | LSTM | ? | SP | Fisher transcripts | LF-MMI TASLP2018 |
8.6 | 17.0 | 12.8 | phone | LF-MMI, TDNN-f | ? | 4-gram | ? | SP | Fisher transcripts | P-Rescoring ICASSP2021 |
8.5 | 17.4 | [13.0] | bi-phone | EE-LF-MMI, TDNN-LSTM | ? | LSTM | ? | SP | Fisher transcripts | EE-LF-MMI TASLP2018 |
8.8 | 17.4 | 13.1 | mono-phone | CTC-CRF, VGG-BLSTM | 39.2 | LSTM | ? | SP | Fisher transcripts | CAT IS2020 |
9.0 | 18.1 | [13.6] | BPE | ATT/CTC | ? | Transformer | ? | SP | Fisher transcripts | ESPnet-Transformer ASRU2019 |
9.7 | 18.4 | 14.1 | mono-phone | CTC-CRF, chunk-based VGG-BLSTM | 39.2 | 4-gram | 1.74 | SP | Fisher transcripts | CAT IS2020 |
9.8 | 18.8 | 14.3 | mono-phone | CTC-CRF, VGG-BLSTM | 39.2 | 4-gram | 1.74 | SP | Fisher transcripts | CAT IS2020 |
10.3 | 19.3 | [14.8] | mono-phone | CTC-CRF, BLSTM | 13.5 | 4-gram | 1.74 | SP | Fisher transcripts | CTC-CRF ICASSP2019 |
The Fisher dataset contains about 1600 hours of English conversational telephone speech (First part: LDC2004S13 for speech data, LDC2004T19 for transcripts; second part: LDC2005S13 for speech data, LDC2005T19 for transcripts).
FisherSwbd
includes both Fisher and Switchboard datasets, which is around 2000 hours in total. Evaluation is commonly conducted over eval2000 and RT03 (LDC2007S10) datasets.
Results are sorted by Sum
WER.
SW | CH | Sum | RT03 | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|---|---|---|
7.5 | 14.3 | [10.9] | 10.7 | bi-phone | LF-MMI, TDNN-LSTM | ? | LSTM | ? | SP | --- | LF-MMI TASLP2018 |
7.6 | 14.5 | [11.1] | 11.0 | bi-phone | EE-LF-MMI, TDNN-LSTM | ? | LSTM | ? | SP | --- | EE-LF-MMI TASLP2018 |
7.3 | 15.0 | 11.2 | ? | mono-phone | CTC-CRF, VGG-BLSTM | 39.2 | LSTM | ? | SP | --- | CAT IS2020 |
8.3 | 15.5 | [11.9] | ? | char | ATT | ? | --- | ? | SP | --- | Tencent-IS2018 |
8.1 | 17.5 | [12.8] | ? | char | RNN-T | ? | 4-gram | ? | SP | --- | Baidu-ASRU2017 |
The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. The corpus is freely available for download, along with separately prepared language-model training data and pre-built language models.
There are four test sets: dev-clean, dev-other, test-clean and test-other. For the sake of display, the results are sorted by test-clean
WER.
dev clean WER | dev other WER | test clean WER | test other WER | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|---|---|---|
1.55 | 4.22 | 1.75 | 4.46 | triphone | LF-MMI multistream CNN | ? | self-attentive simple recurrent unit (SRU) | 139 | SA | --- | ASAPP-ASR |
1.7 | 3.6 | 1.8 | 3.6 | wp | CTC Conformer, wav2vec2.0 | 1017 | --- | --- | SA | --- | ConformerCTC |
--- | --- | 1.9 | 3.9 | wp | RNN-T Conformer | 119 | LSTM | ? | SA | Y | Conformer |
--- | --- | 1.9 | 4.1 | wp | RNN-T ContextNet (L) | 112.7 | LSTM | ? | SA | --- | ContextNet |
--- | --- | 2.1 | 4.2 | wp | CTC vggTransformer | 81 | Transformer | --- | SP, SA | Y | FB2020WPM |
--- | --- | 2.1 | 4.3 | wp | RNN-T Conformer | 119 | --- | --- | SA | Y | Conformer |
--- | --- | 2.26 | 4.85 | chenone | DNN-HMM Transformer | 90 | Transformer | ? | SP, SA | Y | TransHybrid |
1.9 | 4.5 | 2.3 | 5.0 | triphone | DNN-HMM BLSTM | ? | Transformer | ? | --- | Y | RWTH19ASR |
--- | --- | 2.31 | 4.79 | wp | CTC vggTransformer | 81 | 4-gram | ? | SP, SA | Y | FB2020WPM |
--- | --- | 2.5 | 5.8 | wp | ATT CNN-BLSTM | ? | RNN | ? | SA | Y | SpecAug IS2019 |
--- | --- | 2.51 | 5.95 | phone | CTC-CRF Conformer | 51.82 | Transformer | 338 | SA | Y | AdvancingCTC-CRF |
--- | --- | 2.54 | 6.33 | wp | CTC-CRF Conformer | 51.85 | Transformer | 338 | SA | Y | AdvancingCTC-CRF |
--- | --- | 2.6 | 5.59 | chenone | DNN-HMM Transformer | 90 | 4-gram | ? | SP, SA | Y | TransHybrid |
2.4 | 5.7 | 2.7 | 5.9 | wp | Conformer | 116 | --- | --- | SA | --- | ConformerCTC |
--- | --- | 2.8 | 6.8 | wp | ATT CNN-BLSTM | ? | --- | ? | SA | N | SpecAug IS2019 |
2.6 | 8.4 | 2.8 | 9.3 | wp | DNN-HMM LSTM | ? | transformer | ? | --- | Y | RWTH19ASR |
3.87 | 10.28 | 4.09 | 10.65 | phone | CTC-CRF BLSTM | 13 | 4-gram | 1.45 | --- | --- | CTC-CRF ICASSP2019 |
--- | --- | 4.28 | --- | tri-phone | LF-MMI TDNN | ? | 4-gram | ? | SP | --- | LF-MMI Interspeech |
5.1 | 19.1 | 5.9 | 20.0 | biphone | LF-MMI TDNN-f | ? | 4-gram | ? | SP | Y | Pkwrap |
AISHELL-ASR0009-OS1, is a 178- hour open source mandarin speech corpus. It is a part of AISHELL-ASR0009, which contains utterances from 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was made in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The corpus is divided into training, development and testing sets.
test CER | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|
4.5 | char | ATT+CTC, Conformer | ? | LSTM | ? | SA+SP | --- | WNARS |
4.72 | char | ATT+CTC, Conformer | ? | attention rescoring | ? | SA+SP | --- | U2 |
5.2 | char | Comformer | ? | --- | ? | SA | --- | intermediate CTC loss |
6.34 | phone | CTC-CRF, VGG-BLSTM | 16 | 4-gram | 0.7 | SP | --- | CAT IS2020 |
The 4th CHiME challenge sets a target for distant-talking automatic speech recognition using a read speech corpus. Two types of data are employed: 'Real data' - speech data that is recorded in real noisy environments (on a bus, cafe, pedestrian area, and street junction) uttered by actual talkers. 'Simulated data' - noisy utterances that have been generated by artificially mixing clean speech data with noisy backgrounds.
There are four test sets. For the sake of display, the results are sorted by eval real
WER.
dev simu WER | dev real WER | eval simu WER | eval real WER | Unit | AM | AM size (M) | LM | LM size (M) | Data Aug. | Ext. Data | Paper |
---|---|---|---|---|---|---|---|---|---|---|---|
1.15 | 1.50 | 1.45 | 1.99 | phone | wide-residual BLSTM | ? | LSTM | ? | --- | --- | Complex Spectral Mapping |
1.78 | 1.69 | 2.12 | 2.24 | phone | 6 DCNN ensemble | ? | LSTM | ? | --- | --- | USTC-iFlytek CHiME4 system |
2.10 | 1.90 | 2.66 | 2.74 | phone | LF-MMI, TDNN | ? | LSTM | ? | --- | --- | Kaldi-CHiME4 |