OptiSpeech: Lightweight End-to-End text-to-speech model

OptiSpeech is ment to be an efficient, lightweight and fast text-to-speech model for on-device text-to-speech.

I would like to thank Pneuma Solutions for providing GPU resources for training this model. Their support significantly accelerated my development process.

Audio sample

optispeech-mike.mp4

optispeech-demo.mp4

Note that this is still WIP. Final model designed decisions are still being made.

Installation

Inference-only

If you want an inference-only minimum -dependency package that doesn't require pytorch, you can use ospeech

Training and development

We use uv to manage the python runtime and dependencies.

Install uv first, then run the following:

$ git clone https://github.com/mush42/optispeech
$ cd optispeech
$ uv sync

Inference

Command line API

$ python3 -m optispeech.infer  --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
                checkpoint text output_dir

Speaking text using OptiSpeech

positional arguments:
  checkpoint           Path to OptiSpeech checkpoint
  text                 Text to synthesise
  output_dir           Directory to write generated audio to.

options:
  -h, --help           show this help message and exit
  --d-factor D_FACTOR  Scale to control speech rate
  --p-factor P_FACTOR  Scale to control pitch
  --e-factor E_FACTOR  Scale to control energy
  --cuda               Use GPU for inference

Python API

import soundfile as sf
from optispeech.model import OptiSpeech

# Load model
device = torch.device("cpu")
ckpt_path = "/path/to/checkpoint"
model = OptiSpeech.load_from_checkpoint(ckpt_path, map_location="cpu")
model = model.to(device)
model = model.eval()

# Text preprocessing and phonemization
sentence = "A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky."
inference_inputs = model.prepare_input(sentence)
inference_outputs = model.synthesize(inference_inputs)

inference_outputs = inference_outputs.as_numpy()
wav = inference_outputs.wav
sf.write("output.wav", wav.squeeze(), model.sample_rate)

Training

Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.

Training is easy as 1, 2, 3:

1. Prepare Dataset

Given a dataset that is organized as follows:

├── train
│   ├── metadata.csv
│   └── wav
│       ├── aud-00001-0003.wav
│       └── ...
└── val
    ├── metadata.csv
    └── wav
        ├── aud-00764.wav
        └── ...

The metadata.csv file can contain 2, 3 or 4 columns delimited by | (bar character) in one of the following formats:

2 columns: file_id|text
3 columns: file_id|speaker_id|text
4 columns: file_id|speaker_id|language_id|text

Use the preprocess_dataset script to prepare the dataset for training:

$ python3 -m optispeech.tools.preprocess_dataset --help
usage: preprocess_dataset.py [-h] [--format {ljspeech}] dataset input_dir output_dir

positional arguments:
  dataset              dataset config relative to `configs/data/` (without the suffix)
  input_dir            original data directory
  output_dir           Output directory to write datafiles + train.txt and val.txt

options:
  -h, --help           show this help message and exit
  --format {ljspeech}  Dataset format.

If you are training on a new dataset, you must calculate and add **data_statistics ** using the following script:

$ python3 -m optispeech.tools.generate_data_statistics --help
usage: generate_data_statistics.py [-h] [-b BATCH_SIZE] [-f] [-o OUTPUT_DIR] input_config

positional arguments:
  input_config          The name of the yaml config file under configs/data

options:
  -h, --help            show this help message and exit
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        Can have increased batch size for faster computation
  -f, --force           force overwrite the file
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory to save the data statistics

2. [Optional] Choose your backbone

OptiSpeech provides interchangeable types of backbones for the model's encoder and decoder, you can choose the backbone based on your target performance profile.

To help you choose, here's a quick computational-complexity analysis of the available backbones:

Backbone	Config File	FLOPs	MACs	#Params
ConvNeXt	`optispeech.yaml`	10.57 GFLOPS	5.27 GMACs	15.89 M
Light	`light.yaml`	7.88 GFLOPS	3.93 GMACs	10.74 M
Transformer	`transformer.yaml`	14.15 GFLOPS	7.06 GMACs	17.98 M
Conformer	`conformer.yaml`	20.42 GFLOPS	10.19 GMACs	24.35 M

The default backbone is ConvNeXt, but if you want to change it you can edit your experiment config.

3. Start training

To start training run the following command. Note that this training run uses config from hfc_female-en_US. You can copy and update it with your own config values, and pass the name of the custom config file (without extension) instead.

$ python3 -m optispeech.train experiment=hfc_female-en_us

ONNX support

ONNX export

$ python3 -m optispeech.onnx.export --help
usage: export.py [-h] [--opset OPSET] [--seed SEED] checkpoint_path output

Export OptiSpeech checkpoints to ONNX

positional arguments:
  checkpoint_path  Path to the model checkpoint
  output           Path to output `.onnx` file

options:
  -h, --help       show this help message and exit
  --opset OPSET    ONNX opset version to use (default 15
  --seed SEED      Random seed

ONNX inference

$ python3 -m optispeech.onnx.infer --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
                onnx_path text output_dir

ONNX inference of OptiSpeech

positional arguments:
  onnx_path            Path to the exported LeanSpeech ONNX model
  text                 Text to speak
  output_dir           Directory to write generated audio to.

options:
  -h, --help           show this help message and exit
  --d-factor D_FACTOR  Scale to control speech rate.
  --p-factor P_FACTOR  Scale to control pitch.
  --e-factor E_FACTOR  Scale to control energy.
  --cuda               Use GPU for inference

Acknowledgements

Repositories I would like to acknowledge:

BetterFastspeech2: For repo backbone
LightSpeech: for the transformer backbone
JETS: for the phoneme-mel alignment framework
Vocos: For pioneering the use of ConvNext in TTS
Piper-TTS: For leading the charge in on-device TTS. Also for the great phonemizer

Reference

@inproceedings{luo2021lightspeech,
    title={Lightspeech: Lightweight and fast text to speech with neural architecture search},
    author={Luo, Renqian and Tan, Xu and Wang, Rui and Qin, Tao and Li, Jinzhu and Zhao, Sheng and Chen, Enhong and Liu, Tie-Yan},
    booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    pages={5699--5703},
    year={2021},
    organization={IEEE}
}

@article{siuzdak2023vocos,
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author={Siuzdak, Hubert},
  journal={arXiv preprint arXiv:2306.00814},
  year={2023}
}

@INPROCEEDINGS{10446890,
  author={Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion},
  year={2024},
  volume={},
  number={},
  pages={12456-12460},
  keywords={Vocoders;Neural networks;Signal processing;Transformers;Real-time systems;Acoustics;Decoding;ConvNeXt;JETS;text-to-speech;voice conversion;WaveNeXt},
  doi={10.1109/ICASSP48485.2024.10446890}
}

mush42/optispeech