Pheme Model

This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: Pheme: Efficient and Conversational Speech Generation. Demo is available here, while a selection of audio samples can be found here.

Our Pheme TTS framework validates several hypotheses:

We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.

Set Up the Environment

Set up conda environment:

conda create --name pheme3 python=3.10
conda activate pheme3

pip3 install torch torchvision torchaudio
pip3 install -r requirements.txt --no-deps

Download pre-trained SpeechTokenizer and unique token list models:

st_dir="ckpt/speechtokenizer/"
mkdir -p ${st_dir}
cd ${st_dir}
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/SpeechTokenizer.pt"
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/config.json" 
cd ..
wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_tokens.k2symbols"

You need to create an access token to use the speaker embedding of pyannote.

export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN

Download pre-trained T2S and S2A models (the 100M Pheme variant):

git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme
mkdir -p "ckpt/t2s"
mkdir -p "ckpt/s2a"
mv ckpt/pheme/config_t2s.json ckpt/t2s/config.json
mv ckpt/pheme/generation_config.json ckpt/t2s/generation_config.json
mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin
mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json
mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt

or the larger version (300M) at https://huggingface.co/PolyAI/pheme

Prompt-based Generation

The generation can be invoked by:

python transformer_infer.py

Training

Data Preparation

The package requires data of the format: datasets/example/train.json with datasets/audios/ where you store wav files. The manifest should follow the format:

{
    "LJ001-0051.wav": {
      "text": "and paying great attention to the press work or actual process of printing,",
      "raw-text": "and paying great attention to the press work or actual process of printing,",
      "duration": 4.860090702947846,
      "phoneme": "æ|n|d|_|p|eɪ|ɪ|ŋ|_|ɡ|ɹ|eɪ|t|_|ɐ|t|ɛ|n|ʃ|ə|n|_|t|ə|_|ð|ə|_|\"|p|ɹ|ɛ|s|_|w|ɜː|k|\"|_|ɔː|ɹ|_|æ|k|tʃ|uː|əl|_|p|ɹ|ɑː|s|ɛ|s|_|ʌ|v|_|p|ɹ|ɪ|n|t|ɪ|ŋ|,"
    },
    "LJ001-0120.wav": {
    ...
    },
    ...
}

The following command will create semantic and acoustic tokens based on the audios folder.

python utils/get_tokens_speech_tokenizer.py \
    --config_path ckpt/speechtokenizer/config.json \
    --ckpt_path ckpt/speechtokenizer/SpeechTokenizer.pt \
    --encoding_input datasets/example/audios \
    --encoding_output datasets/example/audios-speech-tokenizer

T2S

python train_t2s.py --metapath datasets/example/train.json \
  --val_metapath datasets/example/train.json \
  --output_dir ~/experiments/t2s \
  --model_size tiny --batch_size 16 \
  --nworkers 12 --warmup_steps 10000 \
  --save_steps 500 --n_epochs 10

A2S

python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPEECHTOKENIZER \
  --n_codes 1024 --n_cluster_groups 7 --metapath datasets/example/train.json \
  --val_metapath datasets/example/train.json \
  --warmup_step 10000 --nworkers 12 --first_n_lvls 7 \
  --batch_size 1 --ffd_size 512 --hidden_size 512 --enc_nlayers 1 --nheads 8 \
  --depthwise_conv_kernel_size 5 \
  --val_check_interval 1 --sample_rate 16000 --lr 5e-4 \
  --check_val_every_n_epoch 1 --n_semantic_codes 1024 \
  --distributed

Speed test with TensoRT-LLM:

A100 GPU / 100M Pheme Variant

Model	Batch Size	Steps	RTF (ms)
T2S-S2A Short sentence	1	16	0.133
T2S-S2A Long sentence	1	16	0.133

A100 GPU / 300M Pheme Variant

Model	Batch Size	Steps	RTF (ms)
T2S-S2A Short sentence	1	16	0.143
T2S-S2A Long sentence	1	16	0.143

Acknowledge

MQTTS
SpeechTokenizer
maskgit
SoundStorm

TODO

Add Tensorrt-LLM image

Citation

If you use this code or components of the model in your own work, please cite our work as:

@misc{budzianowski2024pheme,
      title={Pheme: Efficient and Conversational Speech Generation}, 
      author={Paweł Budzianowski and Taras Sereda and Tomasz Cichy and Ivan Vulić},
      year={2024},
      eprint={2401.02839},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Marvinified/pheme