/speech-datasets-collection

a curated list of speech datasets (105+ datasets, 70+ easy to download)

Apache License 2.0Apache-2.0

Speech Datasets Collection

contributions welcome HitCount

Contributions for more speech datasets are welcome! You can issue here with new speech datasets, and the list of datasets in the main branch will be updated weekly (usually on weekends).

This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition).

Over 100 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration.

Notice:

  1. This repository does not show corresponding License of each dataset. Basically it's OK to use these datasets for research purpose only. Please make sure the License is suitable before using for commercial purpose.
  2. Some small-scale speech corpora are not shown here for concision.

1. Data Overview

Dataset Acquisition Sup/Unsup All Languages (Hours) Mandarin (Hours) English (Hours)
download directly supervised 190k + 2110 + 34k +
download directly unsupervised 515k + 1360 + 68k +
download directly total 705k + 3470 + 102k +
need application supervised 52k + 16740 + 50k +
need application unsupervised 60k + 12400 + 57k +
need application total 112k + 29140 + 107k +
total supervised 242k + 18850 + 84k +
total unsupervised 575k + 13760 + 125k +
total total 817k + 32610 + 209k +
  • Mandarin here includes Mandarin-English CS corpora.
  • Sup means supervised speech corpus with high-quality transcription.
  • Unsup means unsupervised or weakly-supervised speech corpus.

2. List of ASR corpora

a. datasets can be downloaded directly

Index Name Language Type/Domain Paper Link Data Link Size (Hours)
1 Librispeech English Reading [paper] [dataset] 960
2 TED_LIUM v1 English Talks [paper] [dataset] 118
3 TED_LIUM v2 English Talks [paper] [dataset] 207
4 TED_LIUM v3 English Talks [paper] [dataset] 452
5 MLS Multilingual Reading [paper] [dataset] 50k +
6 thchs30 Mandarin Reading [paper] [dataset] 35
7 ST-CMDS Mandarin Commands - [dataset] 100
8 aishell Mandarin Recording [paper] [dataset] 178
9 aishell-3 Mandarin Recording [paper] [dataset] 85
10 aishell-4 Mandarin Meeting [paper] [dataset] 120
11 aishell-eval Mandarin Misc - [dataset] 80 +
12 Primewords Mandarin Recording - [dataset] 100
13 aidatatang_200zh Mandarin Record - [dataset] 200
14 MagicData Mandarin Recording - [dataset] 755
15 MagicData-RAMC Mandarin Conversational [paper] [dataset] 180
16 Heavy Accent Corpus Mandarin Conversational - [dataset] 58 +
17 AliMeeting Mandarin Meeting [paper] [dataset] 120
18 CN-Celeb Mandarin Misc [paper] [dataset] unsup(274)
19 CN-Celeb2 Mandarin Misc [paper] [dataset] unsup(1090)
20 The People's Speech English Misc [paper] [dataset] 30000
21 Multilingual TEDx Multilingual Talks [paper] [dataset] 760 +
22 VoxPopuli Multilingual Misc [paper] [dataset] sup(1.8k)+unsup(400k)=400k +
23 Libri-Light English Reading [paper] [dataset] unsup(60k)
24 Common Voice (Multilingual) Multilingual Recording [paper] [dataset] v9.0: sup(15k)+unsup(5k)=20k
25 Common Voice (English) English Recording [paper] [dataset] v9.0: sup(2200)+unsup(700)=2900+
26 JTubeSpeech Japanese Misc [paper] [dataset] 1300
27 ai4bharat NPTEL2020 English(Indian) Lectures - [dataset] weaksup(15.7k)
28 open_stt Russian Misc - [dataset] 20k +
29 ASCEND Mandarin-English CS Conversational [paper] [dataset] 10 +
30 Crowd-Sourced Speech Multilingual Recording [paper] [dataset] 1200 +
31 Spoken Wikipedia Multilingual Recording [paper] [dataset] 1000 +
32 MuST-C Multilingual Talks [paper] [dataset] 6000 +
33 M-AILABS Multilingual Reading - [dataset] 1000
34 CMU Wilderness Multilingual Misc [paper] [dataset] unsup(14k)
35 Gram_Vaani Hindi Recording [paper] [code] [dataset] unsup(1000)+sup(100)
36 VoxLingua107 Multilingual Misc [paper] [dataset] unsup(6600 +)
37 Kazakh Corpus Kazakh Recording [paper] [code] [dataset] 335
38 Voxforge English Recording - [dataset] 130
39 Tatoeba English Recording - [dataset] 200
40 IndicWav2Vec Multilingual Misc [paper] [dataset] unsup(17k +)
41 VoxCeleb English Misc [paper] [dataset] unsup(352)
42 VoxCeleb2 English Misc [paper] [dataset] unsup(2442)
43 RuLibrispeech Russian Read - [dataset] 98
44 MediaSpeech Multilingual Misc [paper] [dataset] 40
45 MUCS 2021 task1 Multilingual Misc - [dataset] 300
46 MUCS 2021 task2 Multilingual Misc - [dataset] 150
47 nicolingua-west-african Multilingual Misc [paper] [dataset] 140 +
48 Samromur 21.05 Samromur Misc [code] [dataset] [dataset][dataset] 145
49 Puebla-Nahuatl Puebla-Nahuatl Misc [paper] [dataset] 150 +
50 Golos Russian Misc [paper] [dataset] 1240
51 ParlaSpeech-HR Croatian Parliament [paper] [dataset] 1816
52 Lyon Corpus French Recording [paper] [dataset] 185
53 Providence Corpus English Recording [paper] [dataset] 364
54 CLARIN Spoken Corpora Czech Recording - [dataset] 1120 +
55 Czech Parliament Plenary Czech Recording - [dataset] 444
56 (Youtube) Regional American Corpus English (Accented) Misc [paper] [dataset] 29k +
57 NISP Dataset Multilingual Recording [paper] [dataset] 56 +
58 Regional African American English (Accented) Recording [paper] [dataset] 130 +
59 Indonesian Unsup Indonesian Misc - [dataset] unsup (3000+)
60 Librivox-Spanish Spanish Recording - [dataset] 120
61 AVSpeech English Audio-Visual [paper] [dataset] unsup(4700)
62 CMLR Mandarin Audio-Visual [paper] [dataset] 100 +
63 Speech Accent Archive English Accented [paper] [dataset] TBC
64 BibleTTS Multilingual TTS [paper] [dataset] 86
65 NST-Norwegian Norwegian Recording - [dataset] 540
66 NST-Danish Danish Recording - [dataset] 500 +
67 NST-Swedish Swedish Recording - [dataset] 300 +
68 NPSC Norwegian Parliament [paper] [dataset] 140
69 CI-AVSR Cantonese Audio-Visual [paper] [dataset] 8 +
70 Aalto Finnish Parliament Finnish Parliament [paper] [dataset] 3100 +
71 UserLibri English Reading [paper] [dataset] -
72 Ukrainian Speech Ukrainian Misc - [dataset] 1300+

b. datasets can be downloaded after application

Index Name Language Type/Domain Paper Link Data Link Size (Hours)
1 Fisher English Conversational [paper] [dataset] 2000
2 WenetSpeech Mandarin Misc [paper] [dataset] sup(10k)+weaksup(2.4k)+unsup(10k)=22.4k
3 aishell-2 Mandarin Recording [paper] [dataset] 1000
4 aidatatang_1505zh Mandarin Recording - [dataset] 1505
5 SLT 2021 CSRC Mandarin Misc [paper] [dataset] 400
6 GigaSpeech English Misc [paper] [dataset] sup(10k)+unsup(23k)=33k
7 SPGISpeech English Misc [paper] [dataset] 5000
8 AESRC 2020 English (accented) Misc [paper] [dataset] 160
9 LaboroTVSpeech Japanese Misc [paper] [dataset] 2000 +
10 TAL_CSASR Mandarin-English CS Lectures - [dataset] 587
11 ASRU 2019 ASR Mandarin-English CS Reading - [dataset] 700 +
12 SEAME Mandarin-English CS Recording [paper] [dataset] 196
13 Fearless Steps English Misc - [dataset] unsup(19k)
14 FTSpeech Danish Meeting [paper] [dataset] 1800 +
15 KeSpeech Mandarin Recording [paper] [dataset] 1542
16 KsponSpeech Korean Conversational [paper] [dataset] 969
17 RVTE database Spanish TV [paper] [dataset] 800 +
18 DiDiSpeech Mandarin Recording [paper] [dataset] 800
19 Babel Multilingual Telephone [paper] [dataset] 1000 +
20 National Speech Corpus English (Singapore) Misc [paper] [dataset] 3000 +
21 MyST Children's Speech English Recording - [dataset] 393
22 L2-ARCTIC L2 English Recording [paper] [dataset] 20 +
23 JSpeech Multilingual Recording [paper] [dataset] 1332 +
24 LRS2-BBC English Audio-Visual [paper] [dataset] 220 +
25 LRS3-TED English Audio-Visual [paper] [dataset] 470 +
26 LRS3-Lang Multilingual Audio-Visual - [dataset] 1300 +
27 QASR Arabic Dialects [paper] [dataset] 2000 +
28 ADI (MGB-5) Arabic Dialects [paper] [dataset] unsup (3000 +)
29 MGB-2 Arabic TV [paper] [dataset] 1200 +
30 3MASSIV Multilingual Audio-Visual [paper] [dataset] sup(310)+unsup(600)
31 MDCC Cantonese Misc [paper] [dataset] 73 +
32 Lahjoita Puhetta Finnish Misc [paper] [dataset] sup(1600) + unsup(2000)
33 SDS-200 Swiss German Dialects [paper] [dataset] 200
34 Modality Corpus Misc Audio-Visual [paper] [dataset] 30 +

3. References