rinigus/pure-maps

Alternative & new OSS-TTS-engines

Olf0 opened this issue · 4 comments

Olf0 commented

New contenders: RHVoice & eSpeakNG

I remembered the side-tracked discussion about alternative and maintained OSS-TTS-engines, when I came across these two TTS engines at F-Droid ([1], [2]):

Both are maintained, but eSpeakNG may output a low quality voice, as most improvements relative to the original eSpeak ([1], [2], [3]) did not address the voice engine proper.
OTOH, eSpeakNG now also has Python bindings contributed, and the eSpeakNG-based Mimic3 listening samples are fine. eSpeakNG is well documented: [1] & [2]

In comparison to eSpeak (and maybe also eSpeakNG), RHVoice seems to provide a higher-quality voice-synthesis and a set of languages which often lack good quality speech-synthesis. RHVoice is documented in multiple languages.

flite

Then I pursued to look up descendents of CMU's flite (Carnegie Mellon University's Festival-lite vox (voice encoder)), so I first started looking for flite proper: latest source code v2.1 - v2.3+, original source code (alternative site) v1.0 - v2.1.0, research context and Festival vox documentation, original webpage with slide deck and scientific paper (alternative site, direct link to paper as HTML pages and Postscript file) Side note: Interesting to still see occasional commits (as of 2022-08-23, the latest one on 2022-05-16) and releases (2.2 on 2020-08-13 and 2.3 in "March 2022", but untagged, hence better consider the master branch as of 2022-05-16 as "flite 2.3+" release) for the original flite.

mimic 1, 2 & 3

Then I looked at the well known flite-based mimic, of which we knew that its first incarnation had ceased development with its release v1.3.0.1 (also the state of the master branch since 2020-03-06) with a few additional bug-fixes in the development branch, works well (solely for English) and is well documented ([1] & [2]).

@rinigus had hopes for mimic2 (see also), which died quickly in 2020, despite having the cool ability to deploy ones own voice.
Furthermore, it was "designed to run in the cloud", which likely means that it uses a lot of resources when the server and client component are running on a single machine.
Mimic2 is also very well documented: [1] & [2]

Mimic-3 is now Mycrofts's focus and it seems to be developed well, but only had a single, proper release yet (v0.2.3) and uses libespeak-ng1.
It is also supported by the cool Mimic Recording Studio to record one's own voice.
Mimic-3 is nicely documented, too: [1] and [2]
Still it needs to be analysed which functions mimic-3 provides over a direct use of libespeak-ng and evaluated if these are worth the additional dependency (technically and WRT sustainability).

Interestingly Mycroft's top-level mimic documentation-page provides listening samples on which the libespeak-ng1-based mimic3 output sounds quite well.

FreeTTS

FreeTTS is also flite-based, but written in Java. Furthermore its la(te)st release is v1.2.2 on 2009-03-09 and its la(te)st commit to SVN-trunk happened on 2012-05-08. Thus not worth to pursue.

NanoTTS, a command-line front-end for PicoTTS

NanoTTS ceased development in 2019, while the la(te)st commit to PicoTTS happened on 2018-02-14 (there are many downstream packages, e.g., this one). As mimic1, NanoTTS and PicoTTS are clearly EOLed, but working fine (in my experience), plus support more languages than mimic1.

TL;DR

RHVoice (documentation) seems to be worth being evaluated for integration in Pure Maps and so does eSpeakNG / libespeak-ng (documentation: [1] & [2]) and / or mimic-3 (documentation [1], [2], [3]), in order to provide maintained and improved TTS synthesis compared to the extant choices mimic1 and NanoTTS (includes PicoTTS). As these legacy components are working fine (well, mimic1 never for me, but for many others), there is currently no need to rush the evaluation and potential adaption of RHVoice and / or eSpeakNG / libespeak-ng / mimic-3.

Thank you very much for this study and summary!

I would vote using Mimic 3, it sounds the best to me, and supports a decent number of languages. And eSpeakNG is a very decent choice too, seeing how many languages it supports.
Not sure if RHVoice should be the first choice, as it doesn't support a lot of languages currently supported by Pure Maps. But it is just my opinion, awesome writeup by Olf0

Olf0 commented

I would vote using Mimic 3, […] And eSpeakNG is a very decent choice too, seeing how many languages it supports.

But if mimic3 is not much more that eSpeakNG (specifically libespeak-ng; that still needs to be checked), using eSpeakNG directly may be better, slenderer, easier socially & technically etc.

Not sure if RHVoice should be the first choice, as it doesn't support a lot of languages currently supported by Pure Maps.

It supports Polish. 😉

More seriously: It all is about the quality of the synthesised speech in relation to the resources (CPU, RAM, I/O, mass storage) used.
You may check, if RHVoice runs under AlienDalvik (or install it natively, if you have a Android device; it is available at F-Droid) and compare its quality with the one of mimic1, PicoTTS (NanoTTS) etc. (I usually also have the Google TTS engine installed on AlienDalvik).

One more suggested by piggz: https://github.com/coqui-ai/TTS . looks to be distributed via pip, at least on PC