I released a new PyTorch implementation of uSFGAN in addition to a better model, Harmonic-plus-Noise uSFGAN here.
This is official PyTorch implementation of uSFGAN, which is a unified source-filter network based on factorization of QPPWG by Yi-Chiao Wu @ Nagoya University (@bigpon).
In this repo, we provide an example to train and test uSFGAN as a vocoder for WORLD acoustic features. More details can be found on our Demo page.
This repository is tested on Ubuntu 20.04 with a Titan RTX 3090 GPU.
- Python 3.8+
- Cuda 11.0
- CuDNN 7+
- PyTorch 1.7.1+
$ cd UnifiedSourceFilterGAN
$ pip install -e .
Please refer to the PWG repo for more details.
- egs: The folder for projects.
- egs/vcc18: The folder of the VCC2018 project.
- egs/vcc18/exp: The folder for trained models.
- egs/vcc18/conf: The folder for configs.
- egs/vcc18/data: The folder for corpus related files (wav, feature, list ...).
- usfgan: The folder of the source codes.
Projects on CMU-ARCTIC corpus are also available
- Check egs/arctic/*
- Dataset separation is based on Official NSF implementation
- Modify the corresponding CUDA paths in
egs/vcc18/run.py
. - Download the Voice Conversion Challenge 2018 (VCC2018) corpus to run the uSFGAN example.
$ cd egs/vcc18
# Download training and validation corpus
$ wget -o train.log -O data/wav/train.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip
# Download evaluation corpus
$ wget -o eval.log -O data/wav/eval.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip
# unzip corpus
$ unzip data/wav/train.zip -d data/wav/
$ unzip data/wav/eval.zip -d data/wav/
- Training wav lists:
data/scp/vcc18_train_22kHz.scp
. - Validation wav lists:
data/scp/vcc18_valid_22kHz.scp
. - Testing wav list:
data/scp/vcc18_eval_22kHz.scp
.
# Extract WORLD acoustic features and statistics of training and testing data
$ bash run.sh --stage 0 --conf uSFGAN_60
- WORLD-related settings can be changed in
egs/vcc18/conf/vcc18.uSFGAN_60.yaml
. - If you want to use another corpus, please create a corresponding config and a file including power thresholds and f0 ranges like
egs/vcc18/data/pow_f0_dict.yml
. - More details about feature extraction can be found in the QPNet repo.
- The lists of auxiliary features will be automatically generated.
- Training aux lists:
data/scp/vcc18_train_22kHz.list
. - Validation aux lists:
data/scp/vcc18_valid_22kHz.list
. - Testing aux list:
data/scp/vcc18_eval_22kHz.list
.
# Training a uSFGAN model with the 'uSFGAN_60' config and the 'vcc18_train_22kHz' and 'vcc18_valid_22kHz' sets.
$ bash run.sh --gpu 0 --stage 1 --conf uSFGAN_60 \
--trainset vcc18_train_22kHz --validset vcc18_valid_22kHz
- The gpu ID can be set by --gpu GPU_ID (default: 0)
- The model architecture can be set by --conf CONFIG (default: uSFGAN_60)
- The trained model resume can be set by --resume NUM (default: None)
# uSFGAN/QPPWG/PWG decoding w/ natural acoustic features
$ bash run.sh --gpu 0 --stage 2 --conf uSFGAN_60 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
# uSFGAN/QPPWG/PWG decoding w/ scaled f0 (ex: halved f0).
$ bash run.sh --gpu 0 --stage 3 --conf uSFGAN_60 --scaled 0.50 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
$ tensorboard --logdir exp
- The training time of uSFGAN_60 with a TITAN RTX 3090 is around 6 days.
If you find the code is helpful, please cite the following article.
@inproceedings{yoneyama21_interspeech,
author={Reo Yoneyama and Yi-Chiao Wu and Tomoki Toda},
title={{Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={2187--2191},
doi={10.21437/Interspeech.2021-517}
}
Development:
Reo Yoneyama @ Nagoya University (@chomeyama)
E-mail: yoneyama.reo@g.sp.m.is.nagoya-u.ac.jp
Advisor:
Yi-Chiao Wu @ Nagoya University (@bigpon)
E-mail: yichiao.wu@g.sp.m.is.nagoya-u.ac.jp
Tomoki Toda @ Nagoya University
E-mail: tomoki@icts.nagoya-u.ac.jp