Help with training/finetuning a zipformer based model

Question

Help with training/finetuning a zipformer based model

daocunyang opened this issue 4 months ago · 6 comments

Hi guys, I'm a newbie trying to finetune a zipformer-based ASR model, sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english using icefall, to improve its performance on my custom data. So far I have collected hours of 8k audio files for training. Each audio file is about 1 minute in duration, and has one person speaking mandarin Chinese in it (not continuously talking, each audio contains occasional silence of several seconds).

I've been trying to follow the yesno example to prepare my own dataset, so that I can train my model with it, as suggested here, but I'm unclear with the following:

How should I organize my dataset locally. Is there any convention about the folder structure, naming convention of the folder and the audio files that I should follow?
Do I need to create both yesno.py and prepare.sh for my own dataset? What's the relationship between these two files?
In prepare.sh, I saw it uses lhotse prepare yesno, do I need to implement the prepare command for my own Chinese data, to support something like 'lhotse prepare my_own_data'? If yes, can someone explain how to do it?
Also in prepare.sh, I noticed the part where it creates lexicon.txt. How should this part look like for my Chinese audio data? Should I list out all the possible Chinese vocabs in my audio here?

Any help is appreciated. It would be really helpful if there is a step-by-step guide on how to train with custom dataset. Thanks in advance.

Answer 1 · 2024-05-20T04:23:49.000Z

hi, thank you for using icefall, i’ll try to answer your questions here: 1. Question 1 and 3 are merged and answered together: To organize the folder of your dataset, I would suggest you to follow the way how the 1st version kaldi organize data_dir, which should include text, wav.scp, utt2spk and spk2utt, the last two files could contain only mappings like ``utt_id utt_id`` if no specific speaker identifies are presented in your dataset. To import the dataset w/o a specific ``lhotse prepare`` implementation, please refer to https://lhotse.readthedocs.io/en/latest/kaldi.html 2. ``yesno.py`` does not exist in the icefall repo, you must be referring to some other files 4. You will need to refer to the aishell-1 recipe to see what’s in the ``lexicon.txt`` best regards jin

…

On May 20, 2024, at 12:13, ddosyang ***@***.***> wrote: Hi guys, I'm a newbie trying to finetune a zipformer-based ASR model, sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english <https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english> using icefall, to improve its performance on my custom data. So far I have collected hours of 8k audio files for training. Each audio file is about 1 minute in duration, and has one person speaking mandarin Chinese in it (not continuously talking, silent here and there). I've been trying to follow the yesno example to prepare my own dataset, so that I can train my model with it, as suggested here <https://icefall.readthedocs.io/en/latest/for-dummies/data-preparation.html>, but I'm unclear with the following: How should I organize my dataset locally. Is there any convention about the folder structure, naming convention of the folder and the audio files that I should follow? Do I need to create both yesno.py <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py> and prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh> for my own dataset? What's the relationship between these two files? In prepare.sh, I saw it uses lhotse prepare yesno, do I need to implement the prepare command for my own data, to support something like 'lhotse prepare my_own_data'? If yes, can someone explain how to do it? Also in prepare.sh, I noticed the part <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh#L57-L58>where it creates lexicon.txt. How should this part look like for my Chinese audio data? Should I list out all the possible Chinese vocabs in my audio here? Any help is appreciated. It would be really helpful if there is a step-by-step guide on how to train with custom dataset. Thanks in advance. — Reply to this email directly, view it on GitHub <#1627>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42H6TDOQXOUXE6P7FB3ZDFZ5PAVCNFSM6AAAAABH67VVVKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDKMBTGU4DSNY>. You are receiving this because you are subscribed to this thread.

Answer 2 · 2024-05-20T05:37:51.000Z

@JinZr Thanks so much for the prompt answer, really appreciate it. Could you kindly help with some follow-up questions (pardon my lack of knowledge with Kaldi and ASR in general):

just to confirm, for the folder structure, do you mean the following?

data_dir/
├─ train/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt
├─ test/
│  ├─ text
│  ├─ wav.scp
│  ├─ utt2spk
│  ├─ spk2utt

since I don't care about the speaker info in my training, both files utt2spk and spk2utt can just contain utt_id utt_id (the two files each just contains 1 single line utt_id utt_id)?
For wav.scp and text, do the contents below look right to you:

wav.scp contains the following: 

data_001 /absolute_file_path_to_001.wav
data_002 /absolute_file_path_to_002.wav
...

text contains the following:

data_001 (Chinese text with words separated (分词后的数据) )
...

In addition, may I ask some general questions about ASR training:

I wonder if you have any advice on how to prepare for the contents of text most efficiently? It involves first performing ASR for each audio file - but given the current ASR model doesn't perform well on these data, it may need quite some human proof reading. Also how should we perform word separation, is it by using jieba, or you have any better recommendation?
For my custom audio data which I plan to use for training, at the beginning of most audio files, there are about 10 to 30 seconds of silence or music (彩铃) playing before the person starts talking. Will such music or silence negatively impact the training results? Do you think it's necessary to remove these parts first before training?

Thanks again for your help.

Answer 3 · 2024-05-20T05:58:46.000Z

hi, 1. for the first question, you are using the correct structure 2. by ``utt_id`` i was referring to the ``utterance id`` that you used to index and pair all the utterances and their corresponding transcripts, i.e. ``data_001``. the number of lines in your ``utt2spk`` and ``spk2utt`` files should be identical to those of your ``wav.scp`` and ``text`` in your case. 3. the format looks good to me 4. sadly i couldn’t think of a better way to do data transcription. on the word segmentation part, i dont think that would be necessary there’s no language model training involved. jieba is good and i can hardly find a more competitive counterpart. 5. i think you should remove those silence or ringtone segments, as you mentioned the average duration of ur data is approximately 1 min, 10 to 30 seconds of silence or non-speech part is already too much. an avg duration of 1 min is also a bit too long for asr training in my opinion. best regards jin

…

On May 20, 2024, at 13:38, ddosyang ***@***.***> wrote: @jinz <https://github.com/jinz> Thanks so much for the prompt answer, really appreciate it. Could you kindly help with some follow-up questions (pardon my lack of knowledge with Kaldi and ASR in general): just to confirm, for the folder structure, do you mean the following? data_dir/ ├─ train/ │ ├─ text │ ├─ wav.scp │ ├─ utt2spk │ ├─ spk2utt ├─ test/ │ ├─ text │ ├─ wav.scp │ ├─ utt2spk │ ├─ spk2utt since I don't care about the speaker info in my training, both files utt2spk and spk2utt can just contain utt_id utt_id (the two files each just contains 1 single line utt_id utt_id)? For wav.scp and text, do the contents below look right to you: wav.scp contains the following: data_001 /absolute_file_path_to_001.wav data_002 /absolute_file_path_to_002.wav ... text contains the following: data_001 (Chinese text with words separated (分词后的数据) ) ... In addition, may I ask some general questions about ASR training: I wonder if you have any advice on how to prepare for the contents of text most efficiently? It involves first performing ASR for each audio file - but given the current ASR model doesn't perform well on these data, it may need quite some human proof reading. Also how should we perform word separation, is it by using jieba, or you have any better recommendation? For my custom audio data which I plan to use for training, at the beginning of most audio files, there are about 10 to 30 seconds of silence or music (彩铃) playing before the person starts talking. Will such music or silence negatively impact the training results? Do you think it's necessary to remove these parts first before training? Thanks again for your help. — Reply to this email directly, view it on GitHub <#1627 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42BO2LCCWX4EET5YMZ3ZDGD4NAVCNFSM6AAAAABH67VVVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZG4YDINRQGM>. You are receiving this because you commented.

Answer 4 · 2024-05-20T08:07:08.000Z

@JinZr Thanks a lot! Your answers really help.

1.Regarding 4, got it, I originally thought word segmentation is needed here. For the text part, should the transcriptions contain spaces and punctuation? For example, in text should I prepare the data to look like the following:

data_001 你好。对的。哦，我现在不太方便。好的我知道了。
or should I prepare it as follows:
data_001 你好对的哦我现在不太方便好的我知道了 ?

2.Regarding transcription for training, another thing I'd like to get your thoughts on is, in our phone conversation scenario, we have some users who tend to think as they speak, and as a result, add some delays when they say certain words and characters. For example, instead of saying "这个没问题" at a regular speed，a user may say "这个——没问题" where it takes a user 1 sec or more to finish saying "个" (the sound is continuous, I call it 拖音, prolonging the pronunciation of a sound). Do you think it's possible to train the ASR model to output special character like "——" so that it can detect 拖音？I'm thinking of adding special character in the transcriptions for text to catch it, but I'm not sure if it's a good idea. Any thoughts?

Thanks again.

Answer 5 · 2024-05-20T08:46:56.000Z

1. the latter one 2. i dont think this is a phenomenon worth modelling specifically best regards jin

…

On May 20, 2024, at 16:07, ddosyang ***@***.***> wrote: @JinZr <https://github.com/JinZr> Thanks a lot! Your answers really help. 1.Regarding 4, got it, I originally thought word segmentation is needed here. For the text part, should the transcriptions contain spaces and punctuation? For example, in text should I prepare the data to look like the following: data_001 你好。对的。哦，我现在不太方便。好的我知道了。 or should I prepare it as follows: data_001 你好对的哦我现在不太方便好的我知道了 ? 2.Regarding transcription for training, another thing I'd like to get your thoughts on is, in our phone conversation scenario, we have some users who tend to think as they speak, and as a result, add some delays when they say certain words and characters. For example, instead of saying "这个没问题" at a regular speed，a user may say "这个——没问题" where it takes a user 1 sec or more to finish saying "个" (the sound is continuous, I call it 拖音, prolonging the pronunciation of a sound). Do you think it's possible to train the ASR model to output special character like "——" so that it can detect 拖音？I'm thinking of adding special character in the transcriptions for text to catch it, but I'm not sure if it's a good idea. Any thoughts? Thanks again. — Reply to this email directly, view it on GitHub <#1627 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42EMHDA5UBPZ2WFZAGLZDGVMHAVCNFSM6AAAAABH67VVVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZHEYDGNBVGI>. You are receiving this because you were mentioned.

Answer 6 · 2024-05-21T02:01:59.000Z

@JinZr Thanks. Two more questions:
1.We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

2.For ASR training, what's the ideal length of each audio? Is 20 seconds ok?