/WaveGrad

MindSpore implementation of WaveGrad, a vocoder for text-to-speech system.

Primary LanguagePythonApache License 2.0Apache-2.0

WaveGrad

MindSpore implementation of WaveGrad, a diffusion based vocoder model for text-to-speech systems.

Demo

sample Transcript: "Be this as it may, the weapon used was only an ordinary axe, which rather indicates that force, not skill, was employed.")

compare_lj

sample Transcript: "This is a MindSpore implementation of the WaveGrad model, a diffusion based vocoder model for text to speech systems. Many thanks to Open I for computational resources!"):

compare_fs2

Dependencies

  1. pip install -r requirements.txt
  2. Install MindSpore.
  3. (Optional) Install mpirun for distributed training.

Generate from your data

From wav files:

python reverse.py --restore model_1000000.ckpt --wav LJ010-0142.wav --save results --device_target Ascend --device_id 0 --plot

From melspectrograms:

python reverse.py --restore model_1000000.ckpt --mel fs.npy --save results --device_target Ascend --device_id 0 --plot

Pretrained Models

Model Dataset Checkpoint Total Batch Size Num Frames Num Mels Hardware MindSpore Version
WaveGrad (base) LJSpeech-1.1 1M steps 256 30 128 8 $\times$ Ascend 1.9.0
WaveGrad (base) AiShell TODO 256 30 128 8 $\times$ Ascend 1.9.0
FastSpeech2 LJSpeech-1.1 TODO 64 - 128 8 $\times$ GPU 1.9.0

For FastSpeech2 model, we skipped the audio preprocess part and directly used this repo's preprocessed melspectrograms.

Train your own model

Step 0 (Data)

0.0

Download LJSpeech-1.1 to ./data/.

0.1

Preprocess data to get a "_wav.npy" and "_feature.npy" for each ".wav" file in your dataset folder. Set your data_path and manifest_path in base.yaml. You can now run the following command:

python preprocess.py --device_target CPU --device_id 0

Step 1 (Train)

1.1 Train on local server

Set up device information:

export MY_DEVICE=Ascend # options: [Ascend, GPU]
export MY_DEVICE_NUM=8

Other training and model parameters can be set in base.yaml.

Train on multiple cards: (each card will have a batch size of hparams.batch_size // MY_DEVICE_NUM)

nohup mpirun --allow-run-as-root -n $MY_DEVICE_NUM python train.py --device_target $MY_DEVICE --is_distributed True --context_mode graph > train_distributed.log &

Train on 1 card:

export MY_DEVICE_ID=0
nohup python train.py --device_target $MY_DEVICE --device_id $MY_DEVICE_ID --context_mode graph > train_single.log &

1.2 Train on 8 Ascend cards on openi

A quick guide on how to use openi:

  1. git clone a repo
  2. create a train task
  3. locally preprocess the data, zip it, and upload to your job's dataset
  4. set task options as follows:

start file:

train.py

Run Parameter:

is_openi = True

is_distributed = True

device_target = Ascend

context_mode = graph

Implementation details

The interpolation operator in both downsample and upsample blocks are replaced by a simple repeat operator, then divided by repeat factor.

Some additions in UBlock are divided by a constant $\sqrt{2}$ to avoid potential numerical overflow.

Acknowlegements

Some materials helpful for understanding diffusion models:

Some repositories that inspired this implementation:

Computational Resources:

License

GNU General Public License v2.0