Model Release for "Generative Spoken Dialogue Language Modeling"?
Opened this issue · 5 comments
Hello!
We are interested in using the HuBERT model trained / fine-tuned on the Fisher corpus as well as the HiFi-GAN Vocoder that generates audio directly from the units for academic research. Is it possible that these models be released soon? Thank you very much!
Hi @siyan-sylvia-li,
As for the vocoder trained with discrete units, we do not plan to release this model soon, so please see this repo: https://github.com/facebookresearch/speech-resynthesis and train it yourself.
Regarding HuBERT, I recommend using the fairseq implementation here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm
Thank you so much! I noticed for the speech-resynthesis repo, there is no support for wav2vec 2.0, but in the gslm's unit2speech module, there is a support for wav2vec 2.0. Are the speech-resynthesis code and the gslm's unit2speech code fundamentally different? Thanks again!
@siyan-sylvia-li,
Yes, they are quite different. In GSLM it is based on Tacotron2.0 and in speech-resynthesis it is based on Hi-FI GAN.
In case you want to use wav2vec2.0, you can extract discrete codes from wav2vec2.0 and use them to train the a unit2speech model from the speech-resynthesis repo
Hello, I have two questions:
- We are thinking about training a unit2speech HiFiGAN, and could not find any training details in the speech resynthesis paper. How many GPU's / hours did it take to train a HiFiGAN, roughly?
- Can you provide more detailed instructions for how we can use wav2vec 2.0 to quantize and then train? I see configuration files in the speech-resynthesis repo that uses other quantization models including HuBERT and CPC, but it is still not clear to me as to how I should adapt the existing config files to using my own quantization models to potentially encode different datasets.
Thank you very much for your time!
Hi @siyan-sylvia-li,
- we train our model on 8GPUs, for 400K iterations. You can see the details in the code: https://github.com/facebookresearch/speech-resynthesis. Training on less GPUs should also work, but probably slower to converge.
- you need to replace the tokens extracted from HuBERT/CPC with tokens extracted from wav2vec2.0. You should first extract the units for the VCTK corpus from here: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit. Then, train your vocoder with these units. You can use this repo for that: https://github.com/facebookresearch/speech-resynthesis