/VQMIVC

Primary LanguageJupyter NotebookMIT LicenseMIT

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

The purpose of project: to soften the sound and remove the noises and distortions and improve the sound quality for the listener

Project Overview: The use of quantized vectors to change the sound in such a way that in the sound wave, each peak becomes a vector and the necessary changes are made on that vector, and at the end, the new vectors replace the original vectors and thus in the sound The desired changes are applied.

System operation: The sound wave is converted into quantized vectors, and its extremes go to a section, and the vectors also go to the encoder. Then the necessary changes are made in the vectors and the new vectors are sent to the decoder

In the original codes, the maximum length of the quantized vectors was 8 units, which I reduced to 7.15 units after trial and error to improve system performance.

I reduced the distortions a little and adjusted the distance between the extremes when they passed the filter in proportion to each group of vectors, in this way a Gauss-like distribution was created in the audio tape, which causes more bass and better sound quality.

im Hossein Nazari, student of the University of Science and Research, Bachelor of Electrical Engineering,tehran,iran.Sorry if I messed up the order of the codes I hope its better. I have noted my changes and impressions of the source codes at the bottom of each of them under the title of Hossein Nazari notes

arXiv GitHub Stars download

Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.

Pre-trained models: google-drive or here | Paper demo

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

📢 Update

Many thanks to ericguizzo & AK391!

  1. A Replicate demo is provided online, so you can play our pre-trained models there, have fun!
  2. VQMIVC can be trained and tested inside a Docker environment via Cog now.
  3. Gradio Web Demo is available, another online demo!

TODO

  • Add more details on how to use Cog for development

Requirements

Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':

pip install -r requirements.txt

Quick start with pre-trained models

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN to try the pre-trained models:

python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path} 

For example:

python convert_example.py -s test_wavs/p225_038.wav -r test_wavs/p334_047.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt 

The converted wav is put in 'converted' directory.

Training and inference:

  • Step1. Data preparation & preprocessing
  1. Put VCTK corpus under directory: 'Dataset/'

  2. Training/testing speakers split & feature (mel+lf0) extraction:

     python preprocess.py
    
  • Step2. model training:
  1. Training with mutual information minimization (MIM):

     python train.py use_CSMI=True use_CPMI=True use_PSMI=True
    
  2. Training without MIM:

     python train.py use_CSMI=False use_CPMI=False use_PSMI=False 
    
  • Step3. model testing:
  1. Put PWG vocoder under directory: 'vocoder/'

  2. Inference with model trained with MIM:

     python convert.py checkpoint=checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt
    
  3. Inference with model trained without MIM:

     python convert.py checkpoint=checkpoints/useCSMIFalse_useCPMIFalse_usePSMIFalse_useAmpTrue/model.ckpt-500.pt
    

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{wang21n_interspeech,
  author={Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},
  title={{VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1344--1348},
  doi={10.21437/Interspeech.2021-283}
}

Acknowledgements:

  • The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
  • The speaker encoder is borrowed from AdaIN-VC;
  • The decoder is modified from AutoVC;
  • Estimation of mutual information is modified from CLUB;
  • Speech features extraction is based on espnet and Pyworld.