Chinese-asr-kaldi-and-other

From now, first build a model by kaldi for chinese from commonvoice, then use keras to build end2end model, keep updating

Features to test

Reference Blogs

About datasets

Some utterances are not correct. I found them when I was preprocessing the data, so I deleted it both in validated.tsv and train.csv. Here is the list:
- common_voice_zh-CN_18682400.mp3
  - 包应登，字稺升，浙江杭州府钱塘县人，□籍，明朝政治人物、进士出身。
  - 包应登，字稺升，浙江杭州府钱塘县人，籍，明朝政治人物、进士出身。
There are japanese in the dataset, I deleted them for I can't process them without lexicon.txt for kaldi, but for end2end ASR model, it does not matter. In addition, I don't think delete them will effect the model performance for there are only few samples with japanese mixed in. I will not show these utterance with Japanese.
Some mispronouciation existed in the Greek Letters, like ε, it's epsilon, but in a utterance, speakers says "ei", so it's deleted. Due to the probabel mistakes like this, I decide to delete all the Greek letters for I don't have time to check it.
There are both chinses and english punctuations, so we need to check that.
Another problem occured during the preprocessing data, it is a chinese words, "別怕", I don't know what's the difference between "別怕" and "别怕", we can see the part of "别" is different. But I am still confused... I listened to the wav, it's "bie2 pa4", so I changed "別怕" to "别怕". The according audio file is common_voice_zh-CN_18597886.wav, the text is 别怕就只是个超人。Aha, the traditional format of "别" is "別".
Some characters does not hava a independent pronouciation while they have pronouciation in words. It's a little bit strange. So I write a program to extract all single characters' prounciation.Example: "忒", "不忒 BU_2 TE_4"
Another problem is that there's no corresponding pronouciation for traditional Chinese,
1. "妳" ---> "你"
2. "寀" ---> "采"
3. "別" ---> "别"

About detecting language

From the blog 用Python进行语言检测, we know that:

Language	RE
English	u"[a-zA-Z]"
Chinese	u"[\u4e00-\u9fa5]+"
Korea	u"[\uac00-\ud7ff]+"
Japanese	u"[\u30a0-\u30ff\u3040-\u309f]+"

About Lexicon and text segmentation

In this repository, we use BigCidian to get the lexicon.
We use two kind of tools for text segment:
- jieba
- thulac

Tools	Included	Not Included	ALL
jieba	15131	4908	20039
thulac	12124	6276	18400

About Tranditional2Simplyfy

There are some libs for this kind of task, like:

But there is a problem about these libs, for example, they can not transfer "寀" to "采". As to my guess, I think there is no this character in their library. So I follow the langconv.py and zh_wiki.py. We can add customized characters into the zh_wiki.py, so this seems better.

Steps of Kaldi

Prepare dataset, lexicon and vocab.
- download the BigCidian
  
  git clone https://github.com/speech-io/BigCiDian.git
- Fix some tiny problem with the word_to_pinyin.txt, run:
  
  python tools/fix_lexicon.py
- Generate the train, dev and test datasets
  
  python data/data.py
- Prepare the lexicon.txt and vocab.txt
  
  python kaldi-script/local/ceate_lex_and_vocab.py

MXuer/chinese-asr-kaldi-and-other