- Official implementation for the paper A Unified One-Shot Prosody and Speaker Conversion System with Self-Supervised Discrete Speech Units.
- Submitted to ICASSP 2023.
- Audio samples/demo for our system can be accessed here
See setup.sh
for details of package installation (especially if you have problem on installing textless-lib).
I used 1.12.1 for pytorch and 1.7.7 for pytorch lightning. No guarantee for the compatability of other versions.
We use the pretrained HiFi-GAN vocoder.
Download the Universal-V1 version from the repo and put the checkpoints at vocoder/cp_hifigan/
Here we show an example of preprocessing VCTK, which you can adjust to your own dataset.
- Download VCTK dataset from here
- Run the below commands to preprocess speech units, pitch, 22k melspectrogram, energy, 16k-resampled speech:
mkdir -p features/VCTK
python s2u.py --VCTK --datadir VCTK_DIR --outdir features/VCTK --with_pitch_unit
The operation is time-consuming due to iterative pitch extraction from textless-lib.
- Run the below command to generate training and validation splits, you can adjust the script to your own dataset:
mkdir datasets
python make_data_vctk.py
This will output two file: datasets/train_vctk.txt
and datasets/valid_vctk.txt
contains filelists for training and validation.
We provide the below command as an example, change the arguments according to your dataset:
mkdir ckpt
python train.py --saving_path ckpt/ \
--training_step 70000 \
--batch_size 200 \
--check_val_every_n_epoch 5 \
--traintxt datasets/train_vctk.txt \
--validtxt datasets/valid_vctk.txt \
[--distributed]
--distributed
if you are training with multiple GPUs--check_val_every_n_epoch
: Eval every n epoch--training_step
: Total training step (generator + discriminator)
Tensorboard logging will be in logs/RV
or LOG_DIR/RV
if you specify --logdir LOG_DIR
.
We provide examples for synthesis of the system in inference.py
, you can adjust this script to your own usage.
Example to run inference.py
:
python inference.py --result_dir ./samples --ckpt CKPT_PATH --config CONFIG_PATH --metapath META_PATH
--ckpt
: .ckpt file that is generated during training, or from the pretrained checkpoints--config
: .json file that is generated at the start of the training, or from the pretrained checkpoints--result_dir
: Your desired output directory for the samples, will create subdirectory for different conversions--metapath
: The txt file contains the source and target speech paths, seeeval.txt
for an example.
The filenames will be {source_wav_name}--{target_wav_name}.wav
. For examples of passing original pitch, energy instead of reconstructed, see inference_exact_pitch.py
with the same arguments.
We provide checkpoints pretrained sperately on VCTK and (LibriTTS-360h + VCTK + ESD). The model is a little bit large since it contains all the training and optimizer states.
For ethical concerns, the discriminator is also in the checkpoint to distinguish fake from true speech.