- Chinese-asr-kaldi-and-other
- Features to test
- Reference Blogs
- About datasets
- About detecting language
- About Lexicon and text segmentation
- About Tranditional2Simplyfy
- Steps of Kaldi
From now, first build a model by kaldi for chinese from commonvoice, then use keras to build end2end model, keep updating
- Efficient Attention: Attention with Linear Complexities
- When Does Label Smoothing Help?
- Label Smoothing & Deep Learning: Google Brain explains why it works and when to use (SOTA tips)
- Some utterances are
not correct
. I found them when I was preprocessing the data, so I deleted it both invalidated.tsv
andtrain.csv
. Here is the list:- common_voice_zh-CN_18682400.mp3
- 包应登,字稺升,浙江杭州府钱塘县人,□籍,明朝政治人物、进士出身。
- 包应登,字稺升,浙江杭州府钱塘县人,籍,明朝政治人物、进士出身。
- common_voice_zh-CN_18682400.mp3
- There are
japanese
in the dataset, I deleted them for I can't process them without lexicon.txt for kaldi, but for end2end ASR model, it does not matter. In addition, I don't think delete them will effect the model performance for there are only few samples with japanese mixed in. I will not show these utterance with Japanese. - Some mispronouciation existed in the
Greek Letters
, likeε
, it's epsilon, but in a utterance, speakers says "ei", so it's deleted. Due to the probabel mistakes like this, I decide to delete all the Greek letters for I don't have time to check it. - There are both chinses and english punctuations, so we need to check that.
- Another problem occured during the preprocessing data, it is a chinese words, "別怕", I don't know what's the difference between "別怕" and "别怕", we can see the part of "别" is different. But I am still confused... I listened to the wav, it's "bie2 pa4", so I changed "別怕" to "别怕". The according audio file is
common_voice_zh-CN_18597886.wav
, the text is别怕就只是个超人
。Aha, the traditional format of "别" is "別". - Some characters does not hava a independent pronouciation while they have pronouciation in words. It's a little bit strange. So I write a program to extract all single characters' prounciation.Example: "忒", "不忒 BU_2 TE_4"
- Another problem is that there's no corresponding pronouciation for traditional Chinese,
- "妳" ---> "你"
- "寀" ---> "采"
- "別" ---> "别"
From the blog 用Python进行语言检测, we know that:
Language | RE |
---|---|
English | u"[a-zA-Z]" |
Chinese | u"[\u4e00-\u9fa5]+" |
Korea | u"[\uac00-\ud7ff]+" |
Japanese | u"[\u30a0-\u30ff\u3040-\u309f]+" |
- In this repository, we use
BigCidian
to get the lexicon. - We use two kind of tools for text segment:
Tools | Included | Not Included | ALL |
---|---|---|---|
jieba | 15131 | 4908 | 20039 |
thulac | 12124 | 6276 | 18400 |
There are some libs for this kind of task, like:
But there is a problem about these libs, for example, they can not transfer "寀" to "采". As to my guess, I think there is no this character in their library. So I follow the langconv.py
and zh_wiki.py
. We can add customized characters into the zh_wiki.py
, so this seems better.
- Prepare dataset, lexicon and vocab.
- download the
BigCidian
- Fix some tiny problem with the
word_to_pinyin.txt
, run:python tools/fix_lexicon.py
- Generate the train, dev and test datasets
python data/data.py
- Prepare the lexicon.txt and vocab.txt
python kaldi-script/local/ceate_lex_and_vocab.py
- download the