The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers.
This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
Extra care was taken to maintain the quality of the generated corpus:
- Only TEDx talks hand transcribed by community translators were included.
- Corpus fragments were segmented at subtitle boundaries.
- Fine tuning segmentation by manual (tool-assisted) speech-text alignment.
- Final validation by state-of-the-art speech recognizer (Google Cloud Speech-To-Text).
The speech audio included in the corpus are 16 bit FLAC files with sampling rate of 16 KHz. Further information on the included speech contents is summarized in the following table:
Title |
Speaker |
Gender |
Location |
Year |
Fragments |
Duration |
Appropriate technology |
이성범 |
M |
Seoul |
2010 |
87 |
5:58 |
Making a village worth living in |
김혜정 |
F |
Busan |
2012 |
191 |
9:14 |
The true owner of land |
남기업 |
M |
Busan |
2012 |
155 |
6:43 |
Starting from where I am |
황두진 |
M |
Seoul |
2010 |
117 |
6:41 |
Telling the new story in the old form |
이자람 |
F |
Seoul |
2010 |
92 |
7:50 |
Dreaming a way to future aerial vehicle from unmanned aircraft |
구삼옥 |
M |
Daedeok |
2011 |
121 |
7:34 |
Misconception about evaluations |
유정식 |
M |
Busan |
2012 |
158 |
6:43 |
Be an artist, right now! |
김영하 |
M |
Seoul |
2013 |
131 |
5:47 |
Communication is recovery |
박임순 |
F |
Busan |
2012 |
161 |
6:24 |
Jeju Olleh |
서명숙 |
F |
Seoul |
2010 |
135 |
9:16 |
DIY OOOSSSZZZ band |
유상준 |
M |
Seoul |
2010 |
44 |
2:22 |
Dynamic biology |
이선희 |
F |
Daedeok |
2011 |
68 |
4:44 |
Active immersion in thinking |
황농문 |
M |
Daejeon |
2012 |
84 |
5:01 |
Becoming a good-earthling |
이현정 |
F |
Busan |
2011 |
95 |
3:53 |
More humane medical experience |
김승범, 정혜진 |
M, F |
Seoul |
2010 |
80 |
4:36 |
Finding new energy to overcome resource limits |
이경수 |
M |
Daejeon |
2010 |
53 |
4:43 |
Which do you love, pictures or camera? |
박희진 |
M |
Busan |
2014 |
38 |
2:42 |
Every citizen is a journalist |
오연호 |
M |
Seoul |
2010 |
61 |
4:10 |
Take time to imagine the world to rights |
윤한결 |
M |
Busan |
2013 |
126 |
5:01 |
With feeling the aesthetics of slowness |
이상은 |
F |
Daejeon |
2011 |
29 |
3:45 |
Beating disabilities to pioneer grassroots journalism |
조주현 |
M |
Daejeon |
2010 |
37 |
3:56 |
Statistics 3.0 |
이인실 |
F |
Busan |
2011 |
94 |
3:42 |
Why Analytical Science? |
정광화 |
F |
Daedeok |
2011 |
58 |
3:56 |
Redefinition of soil and its possibilities |
신근식 |
M |
Busan |
2011 |
76 |
3:51 |
Predict disease with face |
김종열 |
M |
Daedeok |
2011 |
72 |
4:08 |
Sustainable DoReMi |
고건혁 |
M |
Seoul |
2010 |
78 |
3:10 |
ITER, towards the dream of a fusion energy era |
정기정 |
M |
Daedeok |
2010 |
45 |
3:35 |
Winning the world with the 'DID' mindset |
송수용 |
M |
Daejeon |
2010 |
66 |
3:19 |
Social venture is blue ocean |
김정현 |
M |
Busan |
2011 |
60 |
2:56 |
No prerequisite learning, no worry |
신현승 |
M |
Busan |
2012 |
49 |
2:44 |
Passion and challenge |
신창연 |
M |
Busan |
2011 |
88 |
2:46 |
Are science and liberal arts equal? |
김상욱 |
M |
Busan |
2013 |
67 |
2:36 |
Perspective, music and life |
다이나믹듀오 |
M |
Seoul |
2012 |
48 |
2:51 |
아이티 구호현장에서 발견한 음식의 가치 |
김재학 |
M |
Seoul |
2010 |
8 |
0:25 |
A spirit of sharing information and culture 'CC' |
최진권 |
M |
Daejeon |
2010 |
18 |
1:42 |
Gibbons, long-armed apes |
김산하 |
M |
Seoul |
2010 |
73 |
2:22 |
Never let go of your passion, just keep working on it |
김대식 |
M |
Daejeon |
2010 |
23 |
1:50 |
Inconvenient truth of Korean Web |
김기창 |
M |
Busan |
2012 |
37 |
1:52 |
Statecraft, the art of conducting public affairs |
윤여준 |
M |
Seoul |
2010 |
46 |
1:59 |
Korean traditional hawk hunting |
박용순 |
M |
Daejeon |
2011 |
21 |
1:09 |
Multiple identity diaspora |
김경묵 |
M |
Seoul |
2010 |
1 |
0:12 |
The corpus can be downloaded either individually or as a whole from the GitHub repository. Alternatively, they are also available for download in one single archive file in the following link: https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz [170MB].
We are currently preparing a large-sized Korean language ASR corpus by further automating the data processing pipeline used to generate this TEDxKR corpus. The new Korean ASR corpus will also be released under a permissive license once we confirm the types of license with the license holder.