Rutgers ECE capstone(2019): Multilingual ASR data collection
Crawl multilingual audio and text reasources from web, achieve forced alignment on those data.
There would be two part of our project, the first is Crawler, the second is Aligner.
In this part, we achieved web crawling on two website. We crawled multilanguage audio and corresponding text data.
WordProject is a website that provide multilingual version of Bible. Actually, it support 37 languages. The reasources from this website have a perfect match rate.
SBS News is a news website that provide news in over 60 kinds of languages.
In this part, we achieved forced alignment based on Montreal-Forced-Aligner and Kaldi using the data we crawled before.
Our output would be TextGrid format files.
TextGrid demo:
Mo Shi, Chaoji Zuo, Ziqi Wang, Zekun Zhang, Duc Le