/jointts

brainless concatenative text to speech

Primary LanguageJavaScriptMIT LicenseMIT

JoinTTS

Brainless concatenative text to speech.

JoinTTS is a simple off-line (on-premise) concatenative TTS nodejs API.

jointts is the formal name of this project, but you can call it simply joint, that's also the command line program alias. Ah! Thereโ€™s a funny double meaning in the name.๐Ÿ™„

Introduction

The goal is to build a super simple efficient concatenative speech synthesis that at run-time concatenates prerecorded local audio files, without any cloud access.

The system is suitable for applications with a small grammar (a limited set of sentences/words) for a semi-static speech generation.

An example of application could be an embedded system TTS made by mainly fixed output sentences, but containing a small amount of variable/dynamic parts, as entities (codes, names) in template literals.

The target environment is so any sort of embedded system (on-premise/off-line), with poor CPU resources, but the need of a real-time responsive speech output.

The speech is produced by concatenating prepared audio files sources, for letters, words, template literals, entire phrases. All audio files "chunks" needed are prepared offline, to be available afterward, at run-time, for a fast concatenative audio generation.

Text-to-speech output are audio files or in-memory binary blobs (nodejs buffers) in a specific audio codec as PCM or OPUS.

Audio recordings could be realized/sourced in two ways, using in alternative:

  • Real human voices (by voice actors) recordings

    This is specially useful by example in language education apps, for special purposes, as syllables pronunciation.

  • Synthetic voices recording

    You can by example use Google Translate TTS, or any TTS of your choice to prepare speech files/buffers)

    ๐Ÿ’ก Note that using a cloud-based TTS to generate audio chunks is more a test system to workaround the availability of real human voice recording. Please read disclaimer section for details.

Multi language

Speech generation is language-dependent.

JoinTTS can be configured to manage many natural languages. See Multi-language doc.

Segmentation

Input texts could be managed as characters, words, phrases.

  • Static phrases
  • Words concatenation
  • Character-by-character spelling
  • Template literals

See Text segmentation doc.

How it works?

Step 1 - Build "language model" configurations

All audio files required are generated following configuration files settings, with user voice recordings or with any (synthetic voices) third party sources to be downloaded.

Configurations files are:

  • characters.json
  • words.json
  • phrases.json
  • templates.json

They specify which file has to be used for the target concatenation.

Configuration files are language-dependent:

  • config/it/*.json
  • config/en/*.json
  • config/de/*.json
  • ...
          +-------------------+
          |                   |
          |    joinTTS CLI    |
          |                   |
          +---------+---------+
                    |
          +---------v---------+
          |                   |
          | language grammar  |
          | config generator  |
          |                   |
          +---------+---------+
                    |
                    v
            config/it/*.json
            config/en/*.json
            config/de/*.json
                    |
                    v

Step 2 - Build speech audio files

Audio source files can be made in 2 different ways:

  • ๐ŸŽ™ Human voice recordings

    For a personalized voice experience, a voice actor can record all required audio files.

    ๐Ÿ›  TOCOMPLETE

  • ๐Ÿฉน Synthetic voices files

    Audio files are generated by any cloud-based TTS and downloaded as files. A synthetic voice file can be made using any cloud-based TTS as Amazon Polly, Google Cloud Platform Text-to-Speech, etc.

    joinTTS use, for example only, the Google Translate Speech library. Whit jointts (or joint) command line utility, speech MP3 files (containing the Google Translate synthetic voice) can be generated from texts:

    $ jointts download gt
                                 +------------------+
                                 |                  |
                                 |    joinTTS CLI   |
                                 |                  |
                                 +---------+--------+
            config/it/*.json               |
            config/en/*.json               |
            config/de/*.json               |
                    | |          +---------v--------+
                    | +---------->                  |
                    |            |    audio files   |
                    |            |    production    |
                    |            |                  |
                    |            +--------+---------+
                    |                     |
                    |                     v
                    |              audio/it/a.mp3
                    |              audio/it/b.mp3
                    |              audio/it/c.mp3
                    |              ...
                    |                     |
                    v                     v

Step 3 - run-time usage

At run-time the main program call joints run-time engine that generates on the fly audio speech files, concatenating available audio chunks.

           config/it/*.json                 
           config/en/*.json                 
           config/de/*.json                 
                   |               audio/it/a.mp3
                   |               audio/it/b.mp3
                   |               audio/it/c.mp3
                   |               ...
                   |                     |
         +---------v---------------------v----------+
         |                                          |
text --> |            joinTTS run-time API          | --> audio file
'ABC123' |                                          |     ABC123.mp3
         +------------------------------------------+
         |                  ffmpeg                  |
         +------------------------------------------+

See functions documentation:

  • function calls API
  • command line program usage jointts

Installation

  1. ๐Ÿ“ฆ Install ffmpeg

ffmpeg is used acid backend engine for all audio files conversions, audio play, audio concatenations.

sudo apt install ffmpeg 

Optionally, to use OPUS codecs:

sudo apt install libopus0 opus-tools
  1. ๐Ÿ“ฆ Install jointts

The package contains command line program jointts, so you must install the npm package as global:

Download this github repo:

$ git clone https://github.com/solyarisoftware/jointts
$ cd jointts && npm link

Or use npm package manager repo

$ npm install -g jointts

๐Ÿ‘‚ Listen audio rendering examples

Listen here examples of spelling audio rendering for alphanumeric codes.

๐Ÿ›  Status

WORK-IN-PROGRESS / DRAFT.

So far, the project is a proof-of-concept, in pre-alfa stage, with 60% of features implemented. Smart high-level usage has to be defined.

โš ๏ธ Disclaimer

JointTTS run-time usage is intended to basically run on a private environment. You are in charge to manage privacy, permissions, licenses, of all your files.

If you use cloud-based TTS platforms (as Amazon Polly, Google TTS, etc.) to download synthetic voice files in the preparation step, itโ€™s your responsibility to not break any license or copyright.

In the same way, if you use voice recordings of other people, please assure to have permissions to do it.

License

MIT (c) Giorgio Robino


top