speech-recorder is a cross-platform, native node.js addon for getting a stream of audio from a device's microphone. Using speech-recorder, you can also get only the audio that corresponds to someone speaking.
This module is used for speech recognition in Serenade. Serenade enables you to write code through natural speech, rather than typing.
speech-recorder has been tested on Windows 10, macOS 10.14+, and Ubuntu 18.04+ (and may work on other platforms as well).
To install speech-recorder, run:
yarn add speech-recorder
If you're using this library with Electron, you should probably use electron-rebuild.
This library uses two voice activity detection mechanisms: a fast first pass (the WebRTC VAD), and a slightly slower, but much more accurate, second pass (the Silero VAD). See below for the various options you can supply to each.
You can get a list of supported devices with:
import { getDevices } from "speech-recorder";
console.log(getDevices());
You can write all audio to a file with:
import { SpeechRecorder } from "speech-recorder";
const recorder = new SpeechRecorder();
const writeStream = fs.createWriteStream("audio.raw");
recorder.start({
onAudio: (audio) => {
writeStream.write(audio);
}
});
Or, just the speech with:
import { SpeechRecorder } from "speech-recorder";
const recorder = new SpeechRecorder({ framesPerBuffer: 320 });
const writeStream = fs.createWriteStream("audio.raw");
recorder.start({
onAudio: (audio, speech) => {
if (speech) {
writeStream.write(audio);
}
}
});
The SpeechRecorder
constructor supports the following options:
disableSecondPass
: whether or not to disable the second pass. defaults tofalse
.error
: callback called on audio stream error. defaults tonull
.framesPerBuffer
: the number of audio frames to read at a time. defaults to320
.highWaterMark
: thehighWaterMark
to be applied to the underlying stream, or how much audio can be buffered in memory. defaults to64000
(64kb).leadingPadding
: the number of frames to buffer at the start of a speech chunk. this can be prevent audio at the start of the file from getting cut off. defaults to20
.firstPassFilter
: the level of aggressiveness for the first-pass filter on a scale of 0-3, with 0 being the least aggressive and 3 being the most aggressive. defaults to3
.minimumVolume
: a minimum volume threshold for speech.speakingThreshold
: the number of consecutive speaking frames before considering speech to have started. defaults to1
.silenceThreshold
: the number of consecutive non-speaking buffers before considering speech to be finished. defaults to10
.triggers
: a list ofTrigger
objects that can optionally specify when theonTrigger
callback is executed.vadBufferSize
: the number of buffers to pass to the second-pass VAD. i.e., the number of frames passed to the VAD isframesPerBuffer * vadBufferSize
.vadThreshold
: the probability cutoff, between 0–1, for the second-pass VAD. defaults to0.75
. e.g., a value of0.9
will only consider a buffer to be speech if the VAD is 90% confident.
The start
method supports the following options:
deviceId
:id
value fromgetDevices
corresponding to the device you want to use; a value of-1
uses the default device.onAudio
: a callback to be executed when audio data is received from the mic. will be passed(audio, speaking, speech, volume, silence)
, whereaudio
is the buffer of audio data,speaking
is whether or not we're in the speaking state,speech
is whether the current frame is speech (recall that consecutive non-speaking frames must be found to exit the speaking state, sospeaking
andspeech
can be different),volume
is the volume of the audio, andsilence
is the number of consecutive silence frames that have been heard.onChunkStart
: a callback to be executed when a speech chunk starts. will be passed the leading buffer, whose size is determined byleadingPadding
.onChunkEnd
: a callback to be executed when a speech chunk ends.onTrigger
: a callback to be executed when a trigger threshold is met.
See the examples/
directory for example usages.
- speech-recorder uses PortAudio for native microphone access.
- speech-recorder uses webrtcvad as a first-pass filter for voice detection.
- speech-recorder users silero-vad for detecting voice.
- speech-recorder is based on node-portaudio, which in turn is based on naudiodon.