MycroftAI/mycroft-core

Audio pre-transcription parsing

NeonDaniel opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
It can be useful to modify audio passed to STT plugins to remove silence and normalize audio levels for better accuracy. There are also use cases for tagging audio that could be used in skills (speaker identification, mood detection, etc).

Describe the solution you'd like
This is implemented in Neon and the plugin base class is defined in neon-transformers.

Describe alternatives you've considered
N/A

Additional context
This was discussed on the forum https://community.mycroft.ai/t/proposal-for-organizing-functionality-in-mycroft-core/11519/6

Hey, I've definitely talked to different people about some similar things, and I like how the concept of an "audio transformer" abstracts it away from where it comes in the pipeline.

I think that's one of the things we want to explore is how to enable projects to use elements like this in the ways that solve their particular needs, without necessarily needing to modify core itself. This could be a pre-STT, post-TTS, or used for any other purpose. It does a specific task, rather than necessarily being baked into one of these services. As an example, if you had a noise reduction audio transformer:

  • Project A wants to use it pre-transcription to improve recognition.
  • Project B wants to use it to clean up their TTS output.
  • Project C is pulling in audio clips from the a 3rd party source and wants to clean them up before playing them back to the user.
  • and Project D wants to do it all!

The ideal architecture would allow them all without needing to fork core, or the STT/TTS/other service they have selected.

This could be a pre-STT, post-TTS, or used for any other purpose

I hadn't thought of the post-TTS use case, but that would be very useful for cleaning up poor quality outputs (I'm thinking of the old MozillaTTS that would append sounds to text without punctuation), or to handle a user wanting their responses read back faster/slower. If the audio backend doesn't have to deal with those transformations, it also means they should work with any backend.