k2-fsa/icefall

Help with training/finetuning a zipformer based model

daocunyang opened this issue · 6 comments

Hi guys, I'm a newbie trying to finetune a zipformer-based ASR model, sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english using icefall, to improve its performance on my custom data. So far I have collected hours of 8k audio files for training. Each audio file is about 1 minute in duration, and has one person speaking mandarin Chinese in it (not continuously talking, each audio contains occasional silence of several seconds).

I've been trying to follow the yesno example to prepare my own dataset, so that I can train my model with it, as suggested here, but I'm unclear with the following:

  1. How should I organize my dataset locally. Is there any convention about the folder structure, naming convention of the folder and the audio files that I should follow?
  2. Do I need to create both yesno.py and prepare.sh for my own dataset? What's the relationship between these two files?
  3. In prepare.sh, I saw it uses lhotse prepare yesno, do I need to implement the prepare command for my own Chinese data, to support something like 'lhotse prepare my_own_data'? If yes, can someone explain how to do it?
  4. Also in prepare.sh, I noticed the part where it creates lexicon.txt. How should this part look like for my Chinese audio data? Should I list out all the possible Chinese vocabs in my audio here?

Any help is appreciated. It would be really helpful if there is a step-by-step guide on how to train with custom dataset. Thanks in advance.

@JinZr Thanks so much for the prompt answer, really appreciate it. Could you kindly help with some follow-up questions (pardon my lack of knowledge with Kaldi and ASR in general):

  1. just to confirm, for the folder structure, do you mean the following?
data_dir/
├─ train/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt
├─ test/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt
  1. since I don't care about the speaker info in my training, both files utt2spk and spk2utt can just contain utt_id utt_id (the two files each just contains 1 single line utt_id utt_id)?

  2. For wav.scp and text, do the contents below look right to you:

wav.scp contains the following: 

data_001 /absolute_file_path_to_001.wav
data_002 /absolute_file_path_to_002.wav
...

text contains the following:

data_001 (Chinese text with words separated (分词后的数据) )
...

In addition, may I ask some general questions about ASR training:

  1. I wonder if you have any advice on how to prepare for the contents of text most efficiently? It involves first performing ASR for each audio file - but given the current ASR model doesn't perform well on these data, it may need quite some human proof reading. Also how should we perform word separation, is it by using jieba, or you have any better recommendation?

  2. For my custom audio data which I plan to use for training, at the beginning of most audio files, there are about 10 to 30 seconds of silence or music (彩铃) playing before the person starts talking. Will such music or silence negatively impact the training results? Do you think it's necessary to remove these parts first before training?

Thanks again for your help.

@JinZr Thanks a lot! Your answers really help.

1.Regarding 4, got it, I originally thought word segmentation is needed here. For the text part, should the transcriptions contain spaces and punctuation? For example, in text should I prepare the data to look like the following:

data_001 你好。对的。哦,我现在不太方便。好的我知道了。
or should I prepare it as follows:
data_001 你好对的哦我现在不太方便好的我知道了 ?

2.Regarding transcription for training, another thing I'd like to get your thoughts on is, in our phone conversation scenario, we have some users who tend to think as they speak, and as a result, add some delays when they say certain words and characters. For example, instead of saying "这个没问题" at a regular speed,a user may say "这个——没问题" where it takes a user 1 sec or more to finish saying "个" (the sound is continuous, I call it 拖音, prolonging the pronunciation of a sound). Do you think it's possible to train the ASR model to output special character like "——" so that it can detect 拖音?I'm thinking of adding special character in the transcriptions for text to catch it, but I'm not sure if it's a good idea. Any thoughts?

Thanks again.

@JinZr Thanks. Two more questions:
1.We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

2.For ASR training, what's the ideal length of each audio? Is 20 seconds ok?