/ms-bing-speech-service

NodeJS service wrapper for Microsoft Speech API and Custom Speech Service

Primary LanguageJavaScriptMIT LicenseMIT

deprecated label

This library is now deprecated due to Microsoft's release of an official websocket supported NodeJS/JavaScript SDK for Microsoft Speech Service. Please use that instead of this. Thanks! 🙇🏼‍♀️


Microsoft Speech to Text Service

(Unofficial) JavaScript service wrapper for Microsoft Speech API. It is an implementation of the Speech Websocket API specifically, which supports long speech recognition up to 10 minutes in length. Are you looking for Microsoft Speech HTTP API (short speech) support instead? This SDK can help you out :)

npm install ms-bing-speech-service

Installation

  1. Install NodeJS on your computer
  2. Create a new directory for your code project if you haven't already
  3. Open a terminal and run npm install ms-bing-speech-service from your project directory

Usage

✨ This library works in both browsers and NodeJS runtime environments ✨ Please see the examples directory in this repo for more in depth examples than those below.

Microsoft Speech API

You'll first need to create a Microsoft Speech API key. You can do this while logged in to the Azure Portal.

The following code will get you up and running with the essentials in Node:

const speechService = require('ms-bing-speech-service');

const options = {
  language: 'en-US',
  subscriptionKey: '<your Bing Speech API key>'
};

const recognizer = new speechService(options);

recognizer
  .start()
  .then(_ => {
    recognizer.on('recognition', (e) => {
      if (e.RecognitionStatus === 'Success') console.log(e);
    });

    recognizer.sendFile('future-of-flying.wav')
      .then(_ => console.log('file sent.'))
      .catch(console.error);
  })
  .catch(console.error);

You can also use this library with the async/await pattern!

const speechService = require('ms-bing-speech-service');

(async function() {

  const options = {
    language: 'en-US',
    subscriptionKey: '<your Bing Speech API key>'
  };
	
  const recognizer = new speechService(options);
  await recognizer.start();

  recognizer.on('recognition', (e) => {
    if (e.RecognitionStatus === 'Success') console.log(e);
  });
  
  recognizer.on('turn.end', async (e) => {
    console.log('recognizer is finished.');
    
    await recognizer.stop();
    console.log('recognizer is stopped.');
  });
	
  await recognizer.sendFile('future-of-flying.wav');
  console.log('file sent.');

})();

And in the browser (a global window distribution is also available in dist directory). Use an ArrayBuffer instance in place of a file path:

import speechService from 'MsBingSpeechService';

const file = myArrayBuffer;

const options = {
  language: 'en-US',
  subscriptionKey: '<your Bing Speech API key>'
}

const recognizer = new speechService(options);

recognizer.start()
  .then(_ => {
    console.log('service started');

    recognizer.on('recognition', (e) => {
      if (e.RecognitionStatus === 'Success') console.log(e);
    });
    
    recognizer.sendFile(file);
  }).catch((error) => console.error('could not start service:', error));

The above examples will use your subscription key to create an access token with Microsoft's service.

In some instances you may not want to share your subscription key directly with your application. If you're creating an app with multiple users, you may want to issue access tokens from an external API so each user can connect to the speech service without exposing your subscription key.

To do this, replace "subscriptionKey" in the above code example with "accessToken" and pass in the provided token.

const options = {
  language: 'en-US',
  accessToken: '<your access token here>'
};

Custom Speech Service

Yes! You can totally use this with Custom Speech Service. You'll need a few more details in your options object, though.

Your subscriptionKey will be the key displayed on your custom endpoint deployment page in the Custom Speech Management Portal. There, you can also find your websocket endpoint of choice to use.

The following code will get you up and running with the Custom Speech Service:

const speechService = require('ms-bing-speech-service');

const options = {
  subscriptionKey: '<your Custom Speech Service subscription key>',
  serviceUrl: 'wss://<your endpoint id>.api.cris.ai/speech/recognition/conversation/cognitiveservices/v1',
  issueTokenUrl: 'https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken'
};

const recognizer = new speechService(options);

recognizer
  .start()
  .then(_ => {
    recognizer.on('recognition', (e) => {
      if (e.RecognitionStatus === 'Success') console.log(e);
    });

    recognizer.sendFile('future-of-flying.wav');
  }
}).catch(console.error);

See the API section of these docs for details on configuration and methods.

API Reference

Methods

SpeechService(options)

  • options Object
  • Returns SpeechService

Creates a new instance of SpeechService.

const recognizer = new SpeechService(options);

Available options are below:

name type description default required
subscriptionKey String your Speech API key n/a yes
accessToken String your Speech access token. Only required if subscriptionKey option not supplied. n/a no
language String the language you want to translate from. See supported languages in the official Microsoft Speech API docs. 'en-US' no
mode String which recognition mode you'd like to use. Choose from interactive, conversation, or dictation 'conversation' no
format String file format you'd like the text to speech to be returned as. Choose from simple or detailed 'simple' no

recognizer.start()

Connects to the Speech API websocket on your behalf. Returns a promise.

recognizer.start().then(() => {
 console.log('recognizer service started.');
}).catch(console.error);

recognizer.stop()

Disconnects from the established websocket connection to the Speech API. Returns a promise.

recognizer.stop().then(() => {
  console.log('recognizer service stopped.');
}).catch(console.error);

recognizer.sendStream(stream)

  • stream Readable Stream

Sends an audio payload stream to the Speech API websocket connection. Audio payload is a native NodeJS Buffer stream (eg. a readable stream) or an ArrayBuffer in the browser. Returns a promise.

See the 'Sending Audio' section of the official Speech API docs for details on the data format needed.

NodeJS example:

const fs = require('fs');
const audioStream = fs.createReadableStream('speech.wav');

recognizer.sendStream(audioStream).then(() => {
 recognizer.on('recognition', (message) => {
  console.log('new recognition:', message);
 });

 console.log('stream sent.');
}).catch(console.error);

recognizer.sendFile(filepath)

  • filepath String

Streams an audio file from disk to the Speech API websocket connection. Also accepts a NodeJS Buffer or browser ArrayBuffer. Returns a promise.

See the 'Sending Audio' section of the official Speech API docs for details on the data format needed for the audio file.

recognizer.sendFile('/path/to/audiofile.wav').then(() => {
  console.log('file sent.');
}).catch(console.error);

or

fetch('speech.wav')
  .then((response) => response.arrayBuffer())
  .then((audioBuffer) => recognizer.sendFile(audioBuffer))
  .then((recognizer) => console.log('file sent'))
  .catch((error) => console.log('something went wrong:', error));

Events

You can listen to the following events on the recognizer instance:

recognizer.on('recognition', callback)

  • callback Function

Event listener for incoming recognition message payloads from the Speech API. Message payload is a JSON object.

recognizer.on('recognition', (message) => {
  console.log('new recognition:', message);
});

recognizer.on('close', callback)

  • callback Function

Event listener for Speech API websocket connection closures.

recognizer.on('close', (error) => {
  console.log('Speech API connection closed');
  // you can optionally look for an error object (most closures currently report a 1006 even when intentional closure happens but we're looking into it!)
  console.log(error);
});

recognizer.on('error', callback)

  • callback Function

Event listener for incoming Speech API websocket connection errors.

recognizer.on('error', (error) => {
  console.log(error);
});

recognizer.on('turn.start', callback)

  • callback Function

Event listener for Speech API websocket 'turn.start' event. Fires when service detects an audio stream.

recognizer.on('turn.start', () => {
  console.log('start turn has fired.');
});

recognizer.on('turn.end', callback)

  • callback Function

Event listener for Speech API websocket 'turn.end' event. Fires after 'speech.endDetected' event and the turn has ended. This event is an ideal one to listen to in order to be notified when an entire stream of audio has been processed and all results have been received.

recognizer.on('turn.end', () => {
  console.log('end turn has fired.');
});

recognizer.on('speech.startDetected', callback)

  • callback Function

Event listener for Speech API websocket 'speech.startDetected' event. Fires when the service has first detected speech in the audio stream.

recognizer.on('speech.startDetected', () => {
  console.log('speech startDetected has fired.');
});

recognizer.on('speech.endDetected', callback)

  • callback Function

Event listener for Speech API websocket 'speech.endDetected' event. Fires when the service has stopped being able to detect speech in the audio stream.

recognizer.on('speech.endDetected', () => {
  console.log('speech endDetected has fired.');
});

recognizer.on('speech.phrase', callback)

  • callback Function

Identical to the recognition event. Event listener for incoming recognition message payloads from the Speech API. Message payload is a JSON object.

recognizer.on('speech.phrase', (message) => {
  console.log('new phrase:', message);
});

recognizer.on('speech.hypothesis', callback)

  • callback Function

Event listener for Speech API websocket 'speech.hypothesis' event. Only fires when using interactive mode. Contains incomplete recognition results. This event will fire often - beware!

recognizer.on('speech.hypothesis', (message) => {
  console.log('new hypothesis:', message);
});

recognizer.on('speech.fragment', callback)

  • callback Function

Event listener for Speech API websocket 'speech.fragment' event. Only fires when using dictation mode. Contains incomplete recognition results. This event will fire often - beware!

recognizer.on('speech.fragment', (message) => {
  console.log('new fragment:', message);
});

License

MIT.

Credits

Big thanks to @michael-chi. Their bing speech example was a great foundation to build upon, particularly the response parser and header helper.