microsoft/SpeechT5

[SpeechLM] About phoneme tokenizer in detail?

yuseungwoo opened this issue · 1 comments

First of all, Thanks your great works and code

I am studying SpeechLM and found some curious things about training and inference.

  1. Can you guide which stage did you use for learning? below #L155 as I expected?
    [https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh#L155]

  2. Can you guide which decoder is used for Pseudo label generation and share you command ?
    steps/decode_fmllr.sh or online2-wav-gmm-latgen-faster directly?

Best Regards

Sorry for the late response.

  1. Yes, as you expected. We trained two Phoneme tokenizers in our paper, which is a GMM-HMM model using 100-hour data for the Base setting, and a DNN-HMM model using 960-hour data for the Large setting. The GMM-HMM model is exactly 'tri4b' (after stage 13). The DNN-HMM model is exactly the chain model obtained after running the whole script (after the last stage).

  2. steps/decode_fmllr.sh for the GMM-HMM model.