Unified Speech and Gesture Synthesis Using Flow Matching

Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, and Gustav Eje Henter

This is the official code implementation of Unified Speech and Gesture Synthesis Using Flow Matching [ICASSP 2024].

We introduce a new method, Match-TTSG, for diffusion-like joint synthesis of speech and 3D gestures from text. Our main improvements are:

A new architecture that unifies speech and motion synthesis into one single pathway and decoder.
Training using flow matching, a.k.a. rectified flows.

Compared to the previous state of the art, our new method:

Improves speech and motion quality
Is smaller
Is 10 times faster
Generates speech and gestures that are a much better fit for each other

To our knowledge, this is the first method synthesising 3D motion using flow matching or rectified flows.

Check out our demo page and read our ICASSP 2024 paper for more details.

Installation

Create an environment (suggested but optional)

conda create -n matcha-tts python=3.10 -y
conda activate match-ttsg

Install from source

pip install git+https://github.com/shivammehta25/Match-TTSG.git
cd Match-TTSG
pip install -e .

Run CLI / gradio app / jupyter notebook

# This will download the required models
matcha-tts --text "<INPUT TEXT>"

matcha-tts-app

or open synthesis.ipynb on jupyter notebook

CLI Arguments

To synthesise from given text, run:

matcha-tts --text "<INPUT TEXT>"

To synthesise from a file, run:

matcha-tts --file <PATH TO FILE>

To batch synthesise from a file, run:

matcha-tts --file <PATH TO FILE> --batched

Additional arguments

Speaking rate

matcha-tts --text "<INPUT TEXT>" --speaking_rate 1.0

Sampling temperature

matcha-tts --text "<INPUT TEXT>" --temperature 0.667

Euler ODE solver steps

matcha-tts --text "<INPUT TEXT>" --steps 10

Train with your own dataset

Let's assume we are training with LJ Speech

Download the dataset from here, extract it to data/LJSpeech-1.1, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.
Clone and enter the Matcha-TTS repository

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS

Install the package from source

pip install -e .

Go to configs/data/ljspeech.yaml and change

train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt

Generate normalisation statistics with the yaml file of dataset configuration

matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}

Update these values in configs/data/ljspeech.yaml under data_statistics key.

data_statistics:  # Computed for ljspeech dataset
  mel_mean: -5.536622
  mel_std: 2.116101

to the paths of your train and validation filelists.

Run the training script

make train-ljspeech

python matcha/train.py experiment=ljspeech

for a minimum memory run

python matcha/train.py experiment=ljspeech_min_memory

for multi-gpu training, run

python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Synthesise from the custom trained model

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>

ONNX support

Special thanks to @mush42 for implementing ONNX export and inference support in Matcha-TTS, which this project inherits.

It is possible to export Matcha checkpoints to ONNX, and run inference on the exported ONNX graph.

ONNX export

To export a checkpoint to ONNX, first install ONNX with

pip install onnx

then run the following:

python3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5

Optionally, the ONNX exporter accepts vocoder-name and vocoder-checkpoint arguments. This enables you to embed the vocoder in the exported graph and generate waveforms in a single run (similar to end-to-end TTS systems).

Note that n_timesteps is treated as a hyper-parameter rather than a model input. This means you should specify it during export (not during inference). If not specified, n_timesteps is set to 5.

Important: for now, torch>=2.1.0 is needed for export since the scaled_product_attention operator is not exportable in older versions. Until the final version is released, those who want to export their models must install torch>=2.1.0 manually as a pre-release.

ONNX Inference

To run inference on the exported model, first install onnxruntime using

pip install onnxruntime
pip install onnxruntime-gpu  # for GPU inference

then use the following:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs

You can also control synthesis parameters:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0

To run inference on GPU, make sure to install onnxruntime-gpu package, and then pass --gpu to the inference command:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpu

If you exported only Matcha to ONNX, this will write mel-spectrogram as graphs and numpy arrays to the output directory. If you embedded the vocoder in the exported graph, this will write .wav audio files to the output directory.

If you exported only Matcha to ONNX, and you want to run a full TTS pipeline, you can pass a path to a vocoder model in ONNX format:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx

This will write .wav audio files to the output directory.

Citation information

If you use our code or otherwise find this work useful, please cite our paper:

@inproceedings{mehta2024matchttsg,
  author={Mehta, Shivam and Tu, Ruibo and Alexanderson, Simon and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  title={Unified Speech and Gesture Synthesis Using Flow Matching}, 
  year={2024},
  pages={8220-8224},
  doi={10.1109/ICASSP48485.2024.10445998}}

Acknowledgements

Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.

Other source code we would like to acknowledge:

Coqui-TTS: For helping me figure out how to make cython binaries pip installable and encouragement
Hugging Face Diffusers: For their awesome diffusers library and its components
Grad-TTS: For the monotonic alignment search source code
torchdyn: Useful for trying other ODE solvers during research and development
labml.ai: For the RoPE implementation