Vector Quantized PPGs based Voice conversion

Code for this paper Decoupling segmental and prosodic cues of non-native speech through vector quantization

Waris Quamer, Anurag Das, Ricardo Gutierrez-Osuna

Block Diagram

See details and Audio Samples here. Link

Installation

Install ffmpeg.
Install Kaldi
Install PyKaldi
Install packages using environment.yml file.
Download pretrained TDNN-F model, extract it, and set PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh to the pretrained model directory.

Dataset

Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
- You also need to set KALDI_ROOT and PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh accordingly.
Speaker Encoder: LibriSpeech, see here for detailed training process.
Vector Quantization: [ARCTIC and L2-ARCTIC, see here for detailed training process.
Synthesizer (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version.
Vocoder (HiFiGAN): LibriSpeech (Training code to be updated).

All the pretrained the models are available (To be updated) here

Directory layout (Format your dataset to match below)

datatset_root
├── speaker 1
├── speaker 2 
│   ├── wav          # contains all the wav files from speaker 2
│   └── kaldi        # Kaldi files (auto-generated after running kaldi-scripts
.
.
└── speaker N

Quick Start

See the inference script

Training

Use Kaldi to extract BNF for individual speakers (Do it for all speakers)

./kaldi_scripts/extract_features_kaldi.sh /path/to/speaker

Preprocessing

python preprocess_bnfs.py path/to/dataset
python generate_speaker_embeds.py path/to/dataset
python make_data_all.py #Edit the file to specify dataset path

Vector Quantize the BNFs see here
Setting Training params See conf/
Training Model 1

./train_vc128_all.sh

Training Model 2

./train_vc128_all_prosody_ecapa.sh

warisqr007/vq-ppg-vc

Vector Quantized PPGs based Voice conversion

Block Diagram

Installation

Dataset

Directory layout (Format your dataset to match below)

Quick Start

Training