jovotech/jovo-framework

Provide a Platform-independent way to update speech and reprompt values in $response

Closed this issue · 11 comments

I'm submitting a...

  • Bug report
  • Feature request
  • Documentation issue or request
  • Other... Please describe:

Expected Behavior

Be able to call getSpeech/setSpeech and getReprompt/setReprompt values in a way that is consistent across platforms. These methods would access the underlying values for CoreResponse or AlexaResponse, etc.

Current Behavior

Currently, you need to code each response platform separately:

    const platformResponse: NormalizedOutputTemplate[] =
      jovo.$platform.outputTemplateConverterStrategy.fromResponse(response);

    for (const template of Object.values(platformResponse)) {
      if (template instanceof NormalizedOutputTemplate) {
        
        // access/update message
        // access/update reprompt
      }
    }

Error Log

N/A

Your Environment

@jovotech/cli: 4.1.6

Jovo packages of the current project :

  • @jovotech/cli-command-build: 4.1.6
  • @jovotech/cli-command-deploy: 4.1.6
  • @jovotech/cli-command-get: 4.1.6
  • @jovotech/cli-command-new: 4.1.6
  • @jovotech/cli-command-run: 4.1.7
  • @jovotech/cli-core: 4.1.7
  • @jovotech/common: 4.2.10
  • @jovotech/db-filedb: 4.2.16
  • @jovotech/filebuilder: 0.0.1
  • @jovotech/framework: 4.2.16
  • @jovotech/model: 4.0.0
  • @jovotech/model-nlpjs: 4.0.0
  • @jovotech/nlu-nlpjs: 4.2.16
  • @jovotech/output: 4.2.12
  • @jovotech/platform-core: 4.2.16
  • @jovotech/platform-web: 4.2.16
  • @jovotech/plugin-debugger: 4.2.17
  • @jovotech/server-express: 4.2.16

Environment:
System:
OS: Windows 10 10.0.22000
Binaries:
Node: 14.19.0 - C:\Program Files\nodejs\node.EXE
npm: 8.10.0 - C:\Program Files\nodejs\npm.CMD

Would getSpeech (and getReprompt) be responsible for pulling the value from message.speech || message.text || message?

Would setSpeech(value: string) (and setReprompt) be responsible for changing a message property that was a string into one that is an object with speech and text properties?

        if (typeof template.message === 'string') {
          template.message = {
            speech: template.message,
            text: this.ssmlProcessor.isPlainText(template.message)
              ? template.message
              : this.ssmlProcessor.removeSSML(template.message),
          };
        }

        if (!template.message?.speech && template.message?.text) {
          template.message!.speech = template.message.text;
        }

        if (template.message?.speech) {
          template.message.speech = value;
        }

From our Slack thread, this is what @jankoenig said:

TTS is a bit difficult because all platforms have different structures. The TTS Plugin shouldn't be responsible for knowing the platform response structures. We're thinking about adding an abstract method to the Platform class, e.g. getSpeech and setSpeech (similar with reprompt) that can be implemented by each platform and then used by e.g. TTS Plugins to transform the speech.

The challenge with adding getSpeech/setSpeech on Platform or JovoResponse is that they deal with the entire response and not the multiple, possible children of output (CoreResponse) or response (AlexaResponse).

It seems like getSpeech/setSpeech should be on NormalizedOutputTemplate (CoreResponse) or Response (AlexaResponse) but there is no common base class.

But then you still have the issue of how to iterate through the multiple children in a common way.

What if there were an interface that all single response items would implement?

export interface ResponseItem {
    getSpeech(): string;
    setSpeech(value: string): void
    getReprompt(): string;
    setReprompt(value: string): void
}
// Alexa: Response
export class Response implements ResponseItem {
  ...

  getSpeech(): string {
    let speech: string = '';

    const message = this.outputSpeech?.toMessage;

    if (typeof message === 'string') {
      speech = message;
    }

    if (message instanceof SpeechMessage) {
      speech = message.speech;
    }

    if (message instanceof TextMessage) {
      speech = message.text;
    }

    return speech;
  }

  setSpeech(value: string) {
    this.type = OutputSpeechType.Ssml;
    this.ssml = value;
    this.text = undefined;
  }
}
// Core: NormalizedOutputTemplate
export class NormalizedOutputTemplate implements ResponseItem {
  ...

  getSpeech(): string {
    return typeof this.message === 'string' ? this.message : this.message?.speech || this.message?.text || '';
  }

  setSpeech(value: string) {
    if (typeof this.message === 'string') {
      this.message = {
        speech: this.message,
        text: this.message,
      };
    }

    this.message!.speech = value;
    
    if (this.message?.text) {
      this.message.text = removeSSML(this.message.text);
    }
  }
}

Then to iterate through each response item (and call getSpeech/setSpeech), we could have a method on each platform-specific class that would return an array of ResponseItem types:

// Platform

export abstract class Platform<
  REQUEST extends JovoRequest = JovoRequest,
  RESPONSE extends JovoResponse = JovoResponse,
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  JOVO extends Jovo<REQUEST, RESPONSE, JOVO, USER, DEVICE, PLATFORM> = any,
  USER extends JovoUser<JOVO> = JovoUser<JOVO>,
  DEVICE extends JovoDevice<JOVO> = JovoDevice<JOVO>,
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  PLATFORM extends Platform<REQUEST, RESPONSE, JOVO, USER, DEVICE, PLATFORM, CONFIG> = any,
  CONFIG extends PlatformConfig = PlatformConfig,
> extends Extensible<CONFIG, PlatformMiddlewares> {
  ...

  abstract getResponseItems(response: RESPONSE): ResponseItem[]
}
// Alexa: AlexaPlatform

export class AlexaPlatform extends Platform<
  AlexaRequest,
  AlexaResponse,
  Alexa,
  AlexaUser,
  AlexaDevice,
  AlexaPlatform,
  AlexaConfig
> {
  ...

  getResponseItems(response: AlexaResponse): ResponseItem[] {
    return [(response.response as unknown) as ResponseItem];
  }
}
// Core: CorePlatform

export class CorePlatform<PLATFORM extends string = 'core' | string> extends Platform<
  CoreRequest,
  CoreResponse,
  Core,
  CoreUser,
  CoreDevice,
  CorePlatform<PLATFORM>,
  CorePlatformConfig
> {
  ...

  getResponseItems(response: CoreResponse): ResponseItem[] {
    const templates = this.outputTemplateConverterStrategy.fromResponse(response);
    const items = Object.values(templates).map(t => {
      return (t as unknown) as ResponseItem;
    })

    return items;
  }

}

Finally, the TtsPlugin base class could iterate through the response items and not know anything about the platform-specific implementation:

// TtsPlugin (base class implemented by all TTS plugins)

export abstract class TtsPlugin<
  CONFIG extends TtsPluginConfig = TtsPluginConfig,
> extends Plugin<CONFIG> {
  ...

  protected async tts(jovo: Jovo): Promise<void> {
    const response = jovo.$response;

    // if this plugin is not able to process tts, skip
    if (!this.processTts || !response) {
      return;
    }

    const responseItems = jovo.$platform.getResponseItems(response) as ResponseItem[];

    for (const item of Object.values(responseItems)) {
      const speech = item.getSpeech();
      if (speech) {
        // call specific TTS provider
        const result = await this.processText(jovo, speech);
        if (result && result.url) {
          item.setSpeech(buildSpeakTag(buildAudioTag(result.url)));
        }
      }
    }
  }
}

@jankoenig @aswetlow Please see the above thread of me thinking through how this might be implemented.
You know the platform better and have your own ideas.

I would like us to figure out the approach ASAP, so the changes can get into the framework and the base TtsPlugin can be implemented so we can start building TTS plugins.

Thank you!

Here is a branch with me trying to figure out where each of the types should go:
https://github.com/rmtuckerphx/jovo-framework/tree/v4/feature/platform-tts-methods
But there are errors. Need your expertise on this.

Also, I think each platform should surface which SSML tags they support:
Web - audio, break
Alexa - https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html

Also, each TTS plugin should say which SSML tags they support:
Polly - https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html

Then there should be a way after calling getSpeech to split the string into parts based on SSML tags that are supported by the Platform and those that are not. Those parts that aren't supported by the Platform will be passed to the TTS Plugin (Polly) and any unsupported tags will be removed and the resulting web url returned.

Note: a single string response from getSpeech could result in multiple calls to TTS.
ex: "<audio src='https://example.com/audio1.mp3>Some text that could include SSML.<audio src='https://example.com/audio2.mp3> Some other SSML text."
This would be 2 calls to the TTS plugin (or maybe a single call with an array of text/ssml to process.

We need a way to put the string back together before calling setSpeech:
"<audio src='https://example.com/audio1.mp3><audio src='https://example.com/tts1.mp3><audio src='https://example.com/audio2.mp3><audio src='https://example.com/tts2.mp3>"

Each TTS plugin should have access to common set of SSML related utility functions:

  • isPlainText(ssml: string): boolean
  • isSsml(ssml: string): boolean
  • isSupportedTag(supportedTags: string[], ssml: string): boolean
  • buildAudioTag(src: string): string
  • buildSpeakTag(ssml: string): string
  • getAudioSource(ssml: string): string
  • splitSpeechBySupportedTags(platformSupportedTags: string[], ttsSupportedTags: string[], ssml: string): any[]
  • removeSSML(ssml: string, keepTags?: string[]): string

Also, something to consider when removing unsupported tags is that we may still need parts of the unsupported tag instead of removing it all.

For example, the say-as tag:

<speak>
     I was born on <say-as interpret-as="date" format="mdy">12-31-1900</say-as>.
</speak>

Maybe the platform and the TTS plugin don't support say-as so when we call removeSSML to remove unsupported tags, we will need the date to be preserved in the string:
I was born on 12-31-1900.

Maybe the TTS plugin (ex: Polly) or the base class TtsPlugin could handle the processing of some SSML tags (such as say-as) even if the TTS API doesn't support it. Like an SSML pre-processor.

The sub tag is also a good candidate that can be done in code:

<sub alias="new word">abbreviation</sub>

Would getSpeech (and getReprompt) be responsible for pulling the value from message.speech || message.text || message?

Would setSpeech(value: string) (and setReprompt) be responsible for changing a message property that was a string into one that is an object with speech and text properties?

getSpeech and setSpeech would be Response methods, they wouldn't read from an OutputTemplate, but rather the $response.

The challenge with adding getSpeech/setSpeech on Platform or JovoResponse is that they deal with the entire response and not the multiple, possible children of output (CoreResponse) or response (AlexaResponse).

I'm not sure about this one. If I understand it correctly, a TTS plugin wouldn't want to use multiple API calls for multiple output children. Rather, I'd want to have the final speech of the response JSON and then call a TTS API for this one.


EDIT: I see what you mean now. CorePlatform and WebPlatform use the output template structure for the output part of the response. I'll think a bit more about this

It seems like getSpeech/setSpeech should be on NormalizedOutputTemplate (CoreResponse) or Response (AlexaResponse) but there is no common base class.

Here is the base JovoResponse class: https://github.com/jovotech/jovo-framework/blob/v4/latest/output/src/models/JovoResponse.ts

And here's AlexaResponse, for example: https://github.com/jovotech/jovo-framework/blob/v4/latest/platforms/platform-alexa/src/AlexaResponse.ts

I'll have to talk this through with @aswetlow tomorrow, but I'd suggest adding abstract methods to JovoResponse and then have all platforms implement it.

Question is what we should do if a platform doesn't support speech.