MycroftAI/mycroft-core

RemoteTTS "sentence" splitting issues

emphasize opened this issue · 9 comments

def execute(self, sentence, ident=None, listen=False):
phrases = self.__get_phrases(sentence)

@staticmethod
def __get_phrases(sentence):
phrases = re.split(r'\.+[\s+|\n]', sentence)
phrases = [p.replace('\n', '').strip() for p in phrases]
phrases = [p for p in phrases if len(p) > 0]
return phrases

The splitter in its current form is suboptimal, since phrases that aren't touched by Lingua-franca format functions
(e.g. Die aktuelle Uhrzeit ist 22:20:38 Uhr. Mitteleuropäische Sommerzeit, Dienstag, 27. September 2022.)
are split into ["Die aktuelle Uhrzeit ist 22:20:38 Uhr", "Mitteleuropäische Sommerzeit, Dienstag, 27", "September 2022"]

Please make it/them accessable to the class subclassing RemoteTTS. (if it has to stay in the current form)

for instance:

class Sub(RemoteTTS):
    
  @staticmethod
  def _get_phrases(sentence):
      return [sentence]

i deprecated this class in a backwards compatible way in OPM, i removed the split and am using requests directly

https://github.com/OpenVoiceOS/OVOS-plugin-manager/blob/dev/ovos_plugin_manager/templates/tts.py#L984

Yeah I can't see any reason that a subclass shouldn't be able to override it. Are you able to do a little PR for that?

But Jarbas raises a good question, should we allow each plugin to split sentences in its own way or perform it consistently at a higher level?

I think we should consider that independently of this particular (rather rudimentary) implementation of sentence splitting. Though something tells me that "sentences" are not universal enough to handle it in a single way.

The first thing that comes to mind are Spanish questions "¿Cómo te llamas?", but I'm sure there would be other examples that aren't as easy to account for.

There are some additional problems with RemoteTTS, since you essentially bypass caching and the Queue (consequently making pulse_duck: True ineffective).

In RemoteTTS adding/rewriting

from .cache import hash_sentence

def execute(...):
    # for part in self._get_phrases(sentence): ## gets tossed
    response = self._request(sentence).result()
    if response.status_code == 200:
        input_file = self.__cache(sentence, response.content)
        viseme = None
        self.queue.put((self.audio_ext, input_file, viseme, ident, listen))
    else:
        LOG.error(
        '%s Http Error: %s for url: %s' %
        (response.status_code, response.reason, response.url))

def __cache(self, sentence, data):
    if self.config.get("cache", True):  # let configs opt out
        sentence_hash = hash_sentence(sentence)
        file = str(self.cache.define_audio_file(sentence_hash).path)
        self.cache.cached_sentences[sentence_hash] = (file, None)
    else:
        file = self.filename

    with open(file, 'wb') as f:
        f.write(data)

    return file
       

should do the trick.

Maybe we should make split_in_sentences also a config option?
(Why is it a staticmethod atm, where is this used as such?)

def _get_phrases(self, sentence):
    if self.config.get("split_in_sentences", True):
        phrases = re.split(r'\.+[\s+|\n]', sentence)
        phrases = [p.replace('\n', '').strip() for p in phrases]
        phrases = [p for p in phrases if len(p) > 0]
    else:
        phrases = [sentence]

    return phrases

I would do so in the coqui plugin, yet this could be universal and thus less code needed.

Splitting up might be better experience upward a certain RTF.

... should we allow each plugin to split sentences in its own way or perform it consistently at a higher level?

If you're running anything other than "enclosure": "picroft" and the string is not SSML, this is already the case. Just wondered why the config key is disrespected.

there is a split hardcoded in core for picroft, a split hardcoded for RemoteTTS, and mimic2 has its own split logic also

I think this split will be very plugin specific and is not one of the things that should be handled in a one size fits all fashion, my recommendation is to allow each plugin to split things if and how it wants, not to enforce the split.

in ovos I removed all this split logic, it never helped in practice and often caused weirdness, as an anedoctal example this has changed pronounciations completely in some cases simply because the TTS performs better with the full utterance as context

Totally agree with @JarbasAl, that was the idea behind making the split in .preprocess_utterance() so any skill can override as needed (but still providing some sort of sane(R) default that should(TM) be usable in most cases).

The initial sentence splitting was to increase responsiveness for mimic1. (process one sentence was faster than a whole wikipedia article). and then audio would crash on picroft if they were too long so additional splitting was added there too (I think that has been resolved on the platform and removed from Mycroft now). After that Mimic2 had issues with too long text, and needed extra utterance processing.

I do think the sentence splitting for increased responsibility is generally good thing (at least for the TTS architectures I've seen where audio is generated as a final step) but as Jarbas says, it's a tradeoff for quality...

We're coming closer to the genealogy. So, the check should be: (as it is indicated by the TODO)

if tts.config.get("split_in_sentences", True): 
    chunks = tts.preprocess_utterance(utterance)
    ...

with RemoteTTS tossing

def _get_phrases(...)

and related code

and the respective plugin overwriting preprocess_utterance() if they come up with/need something special. Yet, i would make the default Splitter a part of mycroft.util.x