compulim/web-speech-cognitive-services

Integration with React Speech Recognition

JamesBrill opened this issue · 3 comments

Hi @compulim ! I'm the author of the React Speech Recognition hook. I recently made a new release that supports polyfills such as yours. Indeed, yours is currently the first and only polyfill that works (more or less) with react-speech-recognition. You've done a great job - for the most part, it worked smoothly while I was testing the two working together.

Some feedback on some wrinkles I encountered while testing the integration between the two libraries:

  • I think this is an issue in the underlying Cognitive Services SDK, but I found that providing a subscription key rather than an authorization token resulted in authorization errors from Azure. I also found this to be the case in your playground. With that in mind, it would be cool if your polyfill handled the conversion of subscription keys to authorization tokens if your consumers provide them. The token endpoint seems pretty stable so you could make a good guess at it (https://${REGION}.api.cognitive.microsoft.com/sts/v1.0/issuetoken) if the consumer didn't provide it themselves. You could also handle the caching of the authorization tokens.
  • I think this was raised in another GitHub issue, but the Speech Recognition events your polyfill emits don't set resultIndex. react-speech-recognition makes use of this while managing the transcript. This could be set to results.length - 1, which is what I did as a consumer.
  • There seems to be a race condition where sometimes when calling stop, a "final" result that was emitted before the stop gets emitted a second time. I say race condition because I wasn't able to reproduce this consistently. It doesn't happen with the Chrome browser Speech Recognition engine. I was able to find a workaround, but would be nice to get this fixed.
  • Perhaps related to the previous point, but I found that when stopping and immediately restarting the polyfill on Firefox or Safari, it would become unresponsive. I do this when changing languages. Hard to tell what's going on, but again I assume a race condition somewhere.
  • Azure returns 400 responses if no language is explicitly set by the polyfill consumer - it looks like the polyfill uses the language from the DOM by default, which is not always a valid Azure language code.
  • I'm not sure if this is a solvable problem, but I found the need to wrap the polyfill setup in an async function a bit cumbersome. I'm not totally convinced it's necessary - under the hood, it looks like most of your async logic actually happens when the polyfill consumer asks it to start listening rather than when the polyfill is instantiated. The polyfill consumer still has to perform some async logic to get the authorization token - however, as mentioned above, the polyfill could do this work for the consumer, with that logic potentially being run once on the first call to start or in the background on creation.

Thanks for making this polyfill and I hope some of the above is useful. If you want to donate more of your speech recognition polyfill-making skills, there is a similar WIP project for AWS Transcribe that I'd love to be able to integrate with. There's also a general discussion about web speech recognition polyfills here.

Thanks @JamesBrill. Love to see a lot of people investing into W3C APIs than building their own.

  • We can't polyfill "subscription key -> authorization token". In Custom Voices, subscription key is required for getVoicesList(). It won't work with authorization token.
    • In my production-use cases, we almost always use authorization token and didn't hit any authorization issues yet
    • The /issuetoken URL would change when using Speech Services in sovereign cloud, such as, .azure.us
  • This is a good call, will file a bug
  • Do you mean result event with isFinal === true was emitted twice? Will be great if you can give more information on how to increase the chance to repro it, e.g. with shorter/longer phrases, etc.
  • Will try to repro on Firefox and Safari, while keep aborting a speech recognition
    • In the code, the abort function calls Speech SDK's stopContinuousRecognitionAsync, I am worried if the bug is in their code
  • If lang is not set, other than guessing the value from navigator.language, what do you think a better default value should be?
  • Are you talking about fetchCredentials must be async? If no, can you point out which setup functions can be turn into sync?

Love to see many people converting cloud-based services into the W3C Web Speech API. However, my real world job can't afford me doing another hobby projects. I will try to join if I have spare time. But do let me know when your ponyfill is ready. I will try it out in my real world project. 😄

Here is my tips when writing a polyfill for similar systems:

  • [P0] Look at createSpeechRecognitionPonyfill, the core code is sequential and works like a message bus.
    • I tried many designs before and this one is less prone to bugs and race conditions. Easy to diagnose issues. Also able to uncoupled the event behaviors from underlying SDK or network calls. Highly recommend this approach.
    • In some cases, when some events are not emitting from Speech SDK, e.g. soundstart/soundend event, I need to synthetically create them (search FirstAudibleChunk in my code)
  • [P0] Build a lot of test cases, it is very difficult to repro some bugs
  • [P1] Write a playground like mine. My color coded events helped me a lot on debugging the order of events
  • [P1] Allow developers to pass their AudioContext instance. AudioContext will be suspended on create until a user gesture resumed it ("blessed" AudioContext instance). And this blessing is outside the scope of your ponyfill
  • [P2] I love the fetchCredentials design, very flexible on the caller side, but may call it credentials instead
  • [P3] It could be fun if you allow developers to pass a Web Audio graph node, so they can use your SpeechRecognition over WebRTC

I would recommend this signature for ponyfilling more stuffs:

createAWSTranscribeSpeechRecognition({
  audioContext,
  credentials,
  ponyfill: {
    AudioContext, // in case the user did not pass an "audioContext" instance, you will create a new one using this class
    fetch,
    WebSocket
  } = window
})

In this way, you can enable Node.js developers to use your package as long as they provided the needed ponyfills (without pollutions). Also, it will be easier for you to write tests as you can easily mock the external system.

In my production system, we use TTS to test against STT. I.e. we use TTS to generate waveform from textual test data, then feed the waveform into STT for assertion. And also vice versa. We mocked AudioContext in a limited fashion and cross check to make sure both STT/TTS works correctly.

Hi @compulim sorry for the massive delay in replying - your message got lost amongst my other GitHub notifications. Unfortunately, this means I've forgotten a lot of the context from my original message, but I'll do my best to address your questions. It looks like most of my original issues no longer apply.

We can't polyfill "subscription key -> authorization token".

Curiously, I've been able to authenticate by just passing in a subscription key to credentials. This definitely wasn't the case earlier this year, where I was forced to convert it to an authorization token like this:

const response = await fetch(TOKEN_ENDPOINT, {
  method: 'POST',
  headers: { 'Ocp-Apim-Subscription-Key': SUBSCRIPTION_KEY }
});
const authorizationToken = await response.text();
const {
  SpeechRecognition: AzureSpeechRecognition
} = createSpeechServicesPonyfill({
  credentials: {
    region: REGION,
    authorizationToken,
  }
});

So my pain point around doing the subscription key -> auth token conversion is no longer valid - maybe my Speech Service instance was misconfigured back then. It makes sense for consumers to take the burden of performing this conversion in production - this should be performed on their backend to avoid leaking the subscription key. I shall update my docs for this in react-speech-recognition as I'm currently suggesting consumers perform this conversion inside the component (i.e. on the browser), which is not the appropriate place to do that.

Do you mean result event with isFinal === true was emitted twice? Will be great if you can give more information on how to increase the chance to repro it, e.g. with shorter/longer phrases, etc.

I'm afraid I'm not able to repro this any more. There is another bug I've noticed in stop, which I'll raise an issue for shortly.

If lang is not set, other than guessing the value from navigator.language, what do you think a better default value should be?

navigator.language seems to get the language in the locale format that Azure requires. The solution may be as simple as preferring this over the lang attribute when computing the default language here (currently, the lang attribute is preferred).

I checked the createSpeechRecognitionPonyfill is a sync function

I think my point here is also no longer valid - I can see this is indeed the case (perhaps I was confused by the Promise-like then property it returns. I can see that this polyfill can be set up synchronously.

Here is my tips when writing a polyfill for similar systems

Thanks for these! I've not had to implement one of these polyfills myself yet, but will share this with the person who's doing the AWS polyfill.

In my production system, we use TTS to test against STT. I.e. we use TTS to generate waveform from textual test data, then feed the waveform into STT for assertion.

This is really cool - I thought of making some end-to-end tests like this using pre-recorded audio files, but using TTS is a good way of generating deterministic audio inputs.

@compulim @JamesBrill I'm Hoping anyone can help with error handling. If the authorization token is invalid, I'd like to catch the error and refresh the token but the error doesn't enter the catch block.

I checked the source code, and found out that createSpeechServicesPonyfill is an async operation that uses fetch for the network call.

i tried to wrap in a promise then/catch but that's not allowed: console.warn('web-speech-cognitive-services: This function no longer need to be called in an asynchronous fashion. Please update your code. We will remove this Promise.then function on or after 2020-08-10.');

try {
  const { SpeechRecognition: AzureSpeechRecognition } =
    createSpeechServicesPonyfill({
      credentials: {
        region: azureRegion,
        authorizationToken: azureToken,
      },
    });
  SpeechRecognition.applyPolyfill(AzureSpeechRecognition);
} catch (e) {
  console.log("Error Azure", e);
}

I also can't do this:

await createSpeechServicesPonyfill({
  credentials: {
    region: azureRegion,
    authorizationToken: azureToken,
  },
})