This repository contains Fatcord's Alternative WaveRNN (Faster training), which contains a fast-training, small GPU memory implementation of WaveRNN vocoder.
This repo refracts the code and adds slight modifications, and removes running on Jupyter notebook.
- support raw audio wav modelling (via a single Beta Distribution)
- relatively fast synthesis speed without much optimization yet (around 2000 samples/sec on GTX 1060 Ti, 16 GB ram, i5 processor)
- support Fatcord's original quantized (9-bit) wav modelling
-
Single beta distribution on held-out testing data from LjSpeech. This is trained with the single Beta distribution.
-
9-bit audio on held-out testing data from LJSpeech. This model trains the fastest (this is around 130 epochs)
-
10-bit audio on held-out testing data from LJSpeech. This model sounds and trains pretty close to 9 bit. We want the higher bit the better.
- Single Beta Distribution trained for 112k. Make sure to change
hparams.input_type
toraw
. - 9-bit quantized audio trained for 11k, or around 130 epochs, can be trained further. Make sure to change
hparams.input_type
tobits
. - 10-bit quantized audio. To ensure your model is built properly, download the
hparams.py
here, either replace this with your localhparams.py
file or note and update any changes.
- Python 3
- CUDA >=8.0
- PyTorch >= v0.4.1
Ensure above requirements are met.
git clone https://github.com/G-Wang/WaveRNN-Pytorch.git
cd WaveRNN-Pytorch
pip install -r requirements.txt
Before running scripts, one can adjust hyperparameters in hparams.py
.
Some hyperparameters that you might want to adjust:
fix_learning_rate
The model is robust enough to learn well with a fix learning rate of1e-4
, I suggest you try this setting for fastest training, you can decrease this down to5e-6
for final step refinement. Set this toNone
to train with learning rate schedule insteadinput_type
(best performing ones are currentlybits
andraw
, seehparams.py
for more details)batch_size
save_every_step
(checkpoint saving frequency)evaluate_every_step
(evaluation frequency)seq_len_factor
(sequence length of training audio, the longer the more GPU it takes)
This function processes raw wav files into corresponding mel-spectrogram and wav files according to the audio processing hyperparameters.
Example usage:
python preprocess.py /path/to/my/wav/files
This will process all the .wav
files in the folder /path/to/my/wav/files
and save them in the default local directory called data_dir
.
Can include --output_dir
to specify a specific directory to store the processed outputs.
Start training process. checkpoints are by default stored in the local directory checkpoints
.
The script will automatically save a checkpoint when terminated by crtl + c
.
Example 1: starting a new model for training
python train.py data_dir
data_dir
is the directory containing the processed files.
Example 2: Restoring training from checkpoint
python train.py data_dir --checkpoint=checkpoints/checkpoint0010000.pth
Evaluation .wav
files and plots are saved in checkpoints/eval
.
- optimize learning rate schedule
- optimize training hyperparameters (seq_len and batch_size)
- batch generation for synthesis speedup
- model pruning