Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
The text-to-semantic module built here will be used for SoundStorm for conditioning.
-
Stability for their generous sponsorships to work on and open source cutting edge artificial intelligence research
-
Lucas Newman for completing the backtranslation portion, as well as beam search decoding!
-
Lucas Newman for completing the final text to semantic transformer training code!
$ pip install spear-tts-pytorch
import torch
from audiolm_pytorch import HubertWithKmeans
from spear_tts_pytorch import (
TextToSemantic,
SemanticToTextDatasetGenerator,
GeneratedAudioTextDataset,
MockDataset
)
wav2vec = HubertWithKmeans(
checkpoint_path = './hubert_base_ls960.pt',
kmeans_path = './hubert_base_ls960_L9_km500.bin'
)
model = TextToSemantic(
wav2vec = wav2vec,
dim = 512,
num_text_token_ids = 256,
heads = 8,
target_kv_heads = 2, # grouped query attention, for memory efficient decoding
source_depth = 1,
target_depth = 1
)
ds = MockDataset(10)
dataset_generator = SemanticToTextDatasetGenerator(
model = model,
dataset = ds,
folder = './output_folder'
)
dataset_generator(max_length = 2)
generated_dataset = GeneratedAudioTextDataset(
folder = './output_folder'
)
assert len(generated_dataset) == 10
-
add eos logic + generate, and hook up end-to-end generation in soundstorm
-
add first pretraining speech-to-speech with the reconstruction of 60% deleted tokens
-
add dropouts for this project, as low-resource
-
add total flexiblity of which layers of encoder / decoder to freeze during training
-
add step for training on small speech -> text corpus and generating pseudo-labelled dataset + finetuning (thanks to @lucasnewman)
-
add final step of finetuning on text -> speech + pseudolabelled dataset
-
figure out the best way to store and manage the pseudo-labelled generated dataset
-
batched beam search decoding
-
allow for using rotary positions in decoder + flash attention, give Tri another citation
-
integrate speculative decoding with some improvisation - done in same model using early exit strategy
-
add cached key / values for starter + single / grouped key values, make sure flash attention can support specialized causal mask before flash attention 2 is in pytorch core
-
polish the audio-text generation workflow
-
concatting the real audio-text dataset with the generated one -> or being able to convert real audio-text dataset to generated
@misc{kharitonov2023speak,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
author = {Eugene Kharitonov and Damien Vincent and Zalán Borsos and Raphaël Marinier and Sertan Girgin and Olivier Pietquin and Matt Sharifi and Marco Tagliasacchi and Neil Zeghidour},
year = {2023},
eprint = {2302.03540},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
@inproceedings{dao2022flashattention,
title = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
booktitle = {Advances in Neural Information Processing Systems},
year = {2022}
}
@misc{shi2023enhance,
title = {Enhance audio generation controllability through representation similarity regularization},
author = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},
year = {2023},
eprint = {2309.08773},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
@article{Ainslie2023GQATG,
title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebr'on and Sumit K. Sanghai},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.13245},
url = {https://api.semanticscholar.org/CorpusID:258833177}
}
@inproceedings{Leviathan2022FastIF,
title = {Fast Inference from Transformers via Speculative Decoding},
author = {Yaniv Leviathan and Matan Kalman and Y. Matias},
booktitle = {International Conference on Machine Learning},
year = {2022},
url = {https://api.semanticscholar.org/CorpusID:254096365}
}