Model Release for "Generative Spoken Dialogue Language Modeling"?

Question

Model Release for "Generative Spoken Dialogue Language Modeling"?

Opened this issue 3 years ago · 5 comments

Hello!

We are interested in using the HuBERT model trained / fine-tuned on the Fisher corpus as well as the HiFi-GAN Vocoder that generates audio directly from the units for academic research. Is it possible that these models be released soon? Thank you very much!

Answer 1 · 2022-05-25T07:41:19.000Z

Hi @siyan-sylvia-li,
As for the vocoder trained with discrete units, we do not plan to release this model soon, so please see this repo: https://github.com/facebookresearch/speech-resynthesis and train it yourself.
Regarding HuBERT, I recommend using the fairseq implementation here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm

Answer 2 · 2022-05-25T17:50:42.000Z

Thank you so much! I noticed for the speech-resynthesis repo, there is no support for wav2vec 2.0, but in the gslm's unit2speech module, there is a support for wav2vec 2.0. Are the speech-resynthesis code and the gslm's unit2speech code fundamentally different? Thanks again!

Answer 3 · 2022-05-26T07:53:04.000Z

@siyan-sylvia-li,
Yes, they are quite different. In GSLM it is based on Tacotron2.0 and in speech-resynthesis it is based on Hi-FI GAN.
In case you want to use wav2vec2.0, you can extract discrete codes from wav2vec2.0 and use them to train the a unit2speech model from the speech-resynthesis repo

Answer 4 · 2022-05-27T21:44:53.000Z

Hello, I have two questions:

We are thinking about training a unit2speech HiFiGAN, and could not find any training details in the speech resynthesis paper. How many GPU's / hours did it take to train a HiFiGAN, roughly?
Can you provide more detailed instructions for how we can use wav2vec 2.0 to quantize and then train? I see configuration files in the speech-resynthesis repo that uses other quantization models including HuBERT and CPC, but it is still not clear to me as to how I should adapt the existing config files to using my own quantization models to potentially encode different datasets.

Thank you very much for your time!

Answer 5 · 2022-05-29T09:06:06.000Z

Hi @siyan-sylvia-li,

we train our model on 8GPUs, for 400K iterations. You can see the details in the code: https://github.com/facebookresearch/speech-resynthesis. Training on less GPUs should also work, but probably slower to converge.
you need to replace the tokens extracted from HuBERT/CPC with tokens extracted from wav2vec2.0. You should first extract the units for the VCTK corpus from here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit. Then, train your vocoder with these units. You can use this repo for that: https://github.com/facebookresearch/speech-resynthesis