This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: Pheme: Efficient and Conversational Speech Generation. Demo is available here, while a selection of audio samples can be found here.
Our Pheme TTS framework validates several hypotheses:
- We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
- Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
- Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
- One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
- Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
- The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.
Set up conda environment:
conda create --name pheme3 python=3.10
conda activate pheme3
pip3 install torch torchvision torchaudio
pip3 install -r requirements.txt --no-deps
Download pre-trained SpeechTokenizer and unique token list models:
st_dir="ckpt/speechtokenizer/"
mkdir -p ${st_dir}
cd ${st_dir}
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/SpeechTokenizer.pt"
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/config.json"
cd ..
wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_tokens.k2symbols"
You need to create an access token to use the speaker embedding of pyannote.
export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN
Download pre-trained T2S and S2A models (the 100M Pheme variant):
git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme
mkdir -p "ckpt/t2s"
mkdir -p "ckpt/s2a"
mv ckpt/pheme/config_t2s.json ckpt/t2s/config.json
mv ckpt/pheme/generation_config.json ckpt/t2s/generation_config.json
mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin
mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json
mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt
or the larger version (300M) at https://huggingface.co/PolyAI/pheme
The generation can be invoked by:
python transformer_infer.py
The package requires data of the format: datasets/example/train.json
with datasets/audios/
where you store wav
files.
The manifest should follow the format:
{
"LJ001-0051.wav": {
"text": "and paying great attention to the press work or actual process of printing,",
"raw-text": "and paying great attention to the press work or actual process of printing,",
"duration": 4.860090702947846,
"phoneme": "æ|n|d|_|p|eɪ|ɪ|ŋ|_|ɡ|ɹ|eɪ|t|_|ɐ|t|ɛ|n|ʃ|ə|n|_|t|ə|_|ð|ə|_|\"|p|ɹ|ɛ|s|_|w|ɜː|k|\"|_|ɔː|ɹ|_|æ|k|tʃ|uː|əl|_|p|ɹ|ɑː|s|ɛ|s|_|ʌ|v|_|p|ɹ|ɪ|n|t|ɪ|ŋ|,"
},
"LJ001-0120.wav": {
...
},
...
}
The following command will create semantic and acoustic tokens based on the audios
folder.
python utils/get_tokens_speech_tokenizer.py \
--config_path ckpt/speechtokenizer/config.json \
--ckpt_path ckpt/speechtokenizer/SpeechTokenizer.pt \
--encoding_input datasets/example/audios \
--encoding_output datasets/example/audios-speech-tokenizer
python train_t2s.py --metapath datasets/example/train.json \
--val_metapath datasets/example/train.json \
--output_dir ~/experiments/t2s \
--model_size tiny --batch_size 16 \
--nworkers 12 --warmup_steps 10000 \
--save_steps 500 --n_epochs 10
python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPEECHTOKENIZER \
--n_codes 1024 --n_cluster_groups 7 --metapath datasets/example/train.json \
--val_metapath datasets/example/train.json \
--warmup_step 10000 --nworkers 12 --first_n_lvls 7 \
--batch_size 1 --ffd_size 512 --hidden_size 512 --enc_nlayers 1 --nheads 8 \
--depthwise_conv_kernel_size 5 \
--val_check_interval 1 --sample_rate 16000 --lr 5e-4 \
--check_val_every_n_epoch 1 --n_semantic_codes 1024 \
--distributed
Model | Batch Size | Steps | RTF (ms) |
---|---|---|---|
T2S-S2A Short sentence | 1 | 16 | 0.133 |
T2S-S2A Long sentence | 1 | 16 | 0.133 |
Model | Batch Size | Steps | RTF (ms) |
---|---|---|---|
T2S-S2A Short sentence | 1 | 16 | 0.143 |
T2S-S2A Long sentence | 1 | 16 | 0.143 |
MQTTS
SpeechTokenizer
maskgit
SoundStorm
- Add Tensorrt-LLM image
If you use this code or components of the model in your own work, please cite our work as:
@misc{budzianowski2024pheme,
title={Pheme: Efficient and Conversational Speech Generation},
author={Paweł Budzianowski and Taras Sereda and Tomasz Cichy and Ivan Vulić},
year={2024},
eprint={2401.02839},
archivePrefix={arXiv},
primaryClass={eess.AS}
}