
Primary LanguagePythonApache License 2.0Apache-2.0

JTubeSpeech+: Corpus of Japanese speech collected from YouTube

This is a fork from https://github.com/sarulab-speech/jtubespeech

This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.


data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

videoid auto sub channelid
0 0017RsBbUHk True True UCTW2tw0Mhho72MojB1L48IQ
1 00PqfZgiboc False True UCzoghTgl4dvIW9GZF6UC-BA
--- --- --- --- ---

  • lang: Language ID (ja [Japanese], en [English], ...)
  • YYYYMM: Year and month when we collect data
  • videoid: YouTube video ID. Its YouTube page is https://www.youtube.com/watch?v={videoid}.
  • auto: The video has an automatic subtitle or not.
  • sub: The video has a manual (i.e., human-generated) subtitle or not.
  • channelid: YouTube Channel ID. Its YouTube page is https://www.youtube.com/channel/{channelid}.


lang filename (data/) #videos-sub-true #videos-auto-true
ja ja/202103.csv 110,000 (10,000 hours) 4,960,000
en en/202108_tiny.csv 74,227 65,570
zh zh/202108_tiny.csv 63,126 23,387
th th/202108_tiny.csv 40,886 26,907
ru ru/202108_tiny.csv 39,890 46,061
hi hi/202108_tiny.csv 34,034 31,439
ar ar/202108_tiny.csv 31,993 42,649
de de/202108_tiny.csv 30,727 66,954
tr tr/202108_tiny.csv 27,317 68,079
el el/202108_tiny.csv 25,947 26,735
fr fr/202108_tiny.csv 25,371 70,466
ta ta/202108_tiny.csv 21,860 26,120
da da/202108_tiny.csv 18,779 62,094
id id/202108_tiny.csv 18,086 72,760
bn bn/202108_tiny.csv 16,315 57,112
fi fi/202108_tiny.csv 15,561 50,626
my my/202108_tiny.csv 14,729 95,755
hu hu/202108_tiny.csv 13,154 49,237
te te/202108_tiny.csv 11,929 24,444
pt pt/202108_tiny.csv 11,692 48,974
az az/202108_tiny.csv 11,188 52,025
ur ur/202108_tiny.csv 10,917 26,503
is is/202108_tiny.csv 10,632 38,268
fa fa/202108_tiny.csv 10,482 24,102
ka ka/202108_tiny.csv 10,395 23,914
uk uk/202108_tiny.csv 9,103 36,392
ml ml/202108_tiny.csv 9,080 42,359
ga ga/202108_tiny.csv 9,058 51,411
be be/202108_tiny.csv 7,622 37,739
ky ky/202108_tiny.csv 7,241 42,027
kk kk/202108_tiny.csv 6,917 26,163
tg tg/202108_tiny.csv 5,491 40,244



docker pull cadic/jtubespeechp
docker run --rm --name -v /FileStore:/Filestore -it jtubespeechp

Scripts for data collection

jtubespeechp is a package for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages. youtube-dl and ffmpeg are required.

step1: making search words

The module jtubespeechp/search downloads the wikipedia dump file and finds words for searching videos. {lang} is the language code, e.g., ja (Japanese) and en (English).

$ python -m jtubespeechp.search {lang}

step2: obtaining video IDs

The module jtubespeechp/video_id obtains YouTube video IDs by searching by words. {filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g., {filename_word_list}) and run them in parallel.

$ python -m jtubespeechp.video_id {lang} {filename_word_list}

step3: checking if subtitles are available

The module jtubespeechp/subtitles retrieves whether the video has subtitles or not. {filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python -m jtubespeechp.subtitles  {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The module jtubespeechp/download downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved in video/{lang}/wav16k and video/{lang}/txt, respectively.

$ python -m jtubespeechp.download  {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio. The script jtubespeechp/align aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python -m jtubespeechp.align {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments file segments.txt and a log file segments.log in the output directory. Using the segments file, bad utterances or audio files can be sorted-out:

awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt
