This is toy implementation of transformer tts: https://arxiv.org/pdf/1809.08895.pdf
Main simplifications:
- without tokenizer
- without scaled pos-encoding
- without vokoder, only Griffin-Lim
Train time: ~1 day on V100
- hello_world: https://on.soundcloud.com/orgCk
- london is the capital: https://on.soundcloud.com/YMZzC
- Weighs: train_SimpleTransfromerTTS.pt
import IPython
import torch
from model import TransformerTTS
from melspecs import inverse_mel_spec_to_wav
from write_mp3 import write_mp3
train_saved_path = "<your-path>/train_SimpleTransfromerTTS.pt"
state = torch.load(train_saved_path)
model = TransformerTTS().cuda()
model.load_state_dict(state["model"])
text = "Hello, World."
name_file = "hello_world.mp3"
postnet_mel, gate = model.inference(
text_to_seq(text).unsqueeze(0).cuda(),
gate_threshold=1e-5,
with_tqdm = False
)
audio = inverse_mel_spec_to_wav(postnet_mel.detach()[0].T)
plt.plot(
torch.sigmoid(gate[0, :]).detach().cpu().numpy()
)
write_mp3(
audio.detach().cpu().numpy(),
name_file
)
IPython.display.Audio(
audio.detach().cpu().numpy(),
rate=hp.sr
)
- Dowload wavs
- Dowload metafiles
- Edit paths
csv_path
,wav_path
,save_path
,log_path
in hyperparams.py - Run
python train.py