Directing Web Speech API audio to a specific output device?
josephrocca opened this issue · 4 comments
Hello! Have there been any discussions around giving developers the ability to direct speech generated via the Web Speech API SpeechSynthesis
interface to a specific audio output? I've not been able to find any, and it seems like a fairly important feature.
I've critiziced the current Web Speech API for being too tightly coupled to microphone and default speaker output.
I suggest the Web Speech WG work to plug into existing audio sources and sinks in the platform through MediaStreamTrack (there's a precedent in web audio).
Output selection would then fall out for free. E.g.
audioElement.srcObject = speechSynthesis.createMediaStreamDestination();
audioElement.setSinkId(await navigator.mediaDevices.selectAudioOutput({deviceId}));
speechSynthesis.speak(new SpeechSynthesisUtterance("Hello world!"));
Neither this specification nor Media Capture and Streams define capture of devices other than microphone input.
The suggested code is currently impossible at Chromium. Chromium refuses to support listing of or capture of monitor devices at Linux https://bugs.chromium.org/p/chromium/issues/detail?id=931749. Have filed multiple specification and implementation issues to support what this issue requests; in brief see w3c/mediacapture-main#720; w3c/mediacapture-main#720.
To capture output of speechSynthesis.speak()
at Firefox on Linux you can filter monitor device.
To capture output of sppechSynthesis.speak()
at Chromium workarounds must be used, see https://github.com/guest271314/captureSystemAudio.
Hello! Have there been any discussions around giving developers the ability to direct speech generated via the Web Speech API SpeechSynthesis interface to a specific audio output? I've not been able to find any, and it seems like a fairly important feature.
Web Speech API does not define any speech synthesis algorithms and neither Chromium nor Firefox are shipped with a speech synthesis engine.
Web Speech API establishes a socket connection to Speech Dispatcher speechd
https://github.com/brailcom/speechd.
Web Speech API does not currently specify any means to capture audio output from speechSynthesis.speak()
.
Since Web Speech API simply sommunicates with locally installed speech synthesis engine, one approach would be no use Web Speech API at all. Rather, install one or more speech synthesis engines locally and communicate with the engine directly. For example, the output from espeak-ng
https://github.com/espeak-ng/espeak-ng is 1 channel WAV, where STDOUT
(raw binary data) from $ espeak-ng --stdout 'test'
can be passed as a message to any origin parsed to Float32Array
and set as outputs
at AudioWorkletProcessor.process()
, where a MediaStream
can be used for output using MediaStreamAudioDestinationNode
. This is one wortking version of using Native Messaging with espeak-ng
to capture speech synthesis output https://github.com/guest271314/native-messaging-espeak-ng, will update the above to the version described above
@josephrocca There is not a simple way to get the direct output from a speech synthesis engine other than calling the engine directly and processing raw audio output. Technically, a socket connection can be established to speech-dispatcher
No specification, including Media Capture and Streams, Audio Output Devices API, Web Audio API, or Web SPeech API (see MediaStream, ArrayBuffer, Blob audio result from speak() for recording?, https://github.com/WebAudio/web-audio-api-v2/issues/10#issuecomment-682259080) defines a means to access or capture speech synthesis engine output directly.
$ speech-dispatcher -h
Speech Dispatcher -- Common interface for Speech Synthesis (GNU GPL)
Usage: speech-dispatcher [-{d|s}] [-l {1|2|3|4|5}] [-c com_method] [-S socket_path] [-p port] [-t timeout] | [-v] | [-h]
Options:
-d, --run-daemon Run as a daemon
-s, --run-single Run as single application
-a, --spawn Start only if autospawn is not disabled
-l, --log-level Set log level (between 1 and 5)
-L, --log-dir Set path to logging
-c, --communication-method
Communication method to use ('unix_socket'
or 'inet_socket')
-S, --socket-path Socket path to use for 'unix_socket' method
(filesystem path or 'default')
-p, --port Specify a port number for 'inet_socket' method
-t, --timeout Set time in seconds for the server to wait before it
shuts down, if it has no clients connected
-P, --pid-file Set path to pid file
-C, --config-dir Set path to configuration
-m, --module-dir Set path to modules
-v, --version Report version of this program
-D, --debug Output debugging information into $TMPDIR/speechd-debug
if TMPDIR is exported, otherwise to /tmp/speechd-debug
-h, --help Print this info
Please report bugs to speechd-discuss@nongnu.org
Aside from more elaborate solutions that involve growing WebAssembly.Memory
WebAudio/web-audio-api-v2#97 and streaming monitor device from Nightly to Chromium https://gist.github.com/guest271314/04a539c00926e15905b86d05138c113c one solution is to use a local server. There are then ways to get the MediaStreamTrack
from localhost to any origin. Note capturing monitor device captures all system audio output, not only from speech-dispatcher
speech synthesis module.
You can use any language for a server. Here we use php
with espeak-ng
speech synthesis engine
speak.php
<?php
if (isset($_POST["speak"])) {
header("Access-Control-Allow-Origin: localhost:8000");
header("Content-Type: application/octet-stream");
$input = urldecode($_POST["speak"]);
$options = urldecode($_POST["options"]);
echo passthru("espeak-ng --stdout " . $options . " '" . $input . "'");
exit();
}
Using MediaStreamAudioSourceNode
// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
const output = new Float32Array(inputArray.length);
for (let i = 0; i < output.length; i++) {
const int = inputArray[i];
// If the high bit is on, then it is a negative number, and actually counts backwards.
const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
output[i] = float;
}
return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.
- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
const uint16 = new Uint16Array(arrayBuffer.slice(44));
const floats = int16ToFloat32(uint16, 1, uint16.length);
const ac = new AudioContext({sampleRate: 22050});
const buffer = new AudioBuffer({
numberOfChannels: 1,
length: floats.byteLength,
sampleRate: ac.sampleRate
});
console.log(floats);
buffer.getChannelData(0).set(floats);
const absn = new AudioBufferSourceNode(ac, {buffer});
// cannot be connected directly to AudioContext.destination
const msd = new MediaStreamAudioDestinationNode(ac);
const {stream: mediaStream} = msd;
const source = new MediaStreamAudioSourceNode(ac, {mediaStream});
absn.connect(msd);
absn.start();
source.connect(ac.destination);
});
Using AudioWorkletNode
with a single Float32Array
initially passed. We could write the stream STDOUT
to a single ArrayBuffer
or SharedArrayBuffer
using Response.body.getReader()
and read from that memory in process()
// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
const output = new Float32Array(inputArray.length);
for (let i = 0; i < output.length; i++) {
const int = inputArray[i];
// If the high bit is on, then it is a negative number, and actually counts backwards.
const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
output[i] = float;
}
return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.
- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
const uint16 = new Uint16Array(arrayBuffer.slice(44));
const floats = int16ToFloat32(uint16, 1, uint16.length);
const ac = new AudioContext({sampleRate: 22050});
console.log(ac.state);
class AudioWorkletProcessor {}
class SpeechSynthesisStream extends AudioWorkletProcessor {
constructor(options) {
super(options);
Object.assign(this, options.processorOptions);
globalThis.console.log(this.floats);
this.port.postMessage({start:this.start = !this.start});
}
endOfStream() {
this.port.postMessage({
ended: true,
currentTime,
currentFrame,
readOffset: this.readOffset,
});
}
process(inputs, outputs) {
const [channel] = outputs.flat();
if (
this.readOffset >= this.floats.length
) {
console.log(this);
this.endOfStream();
return false;
}
const data = Float32Array.from({length: 128}, _ => {
const index = this.readOffset;
if (index > this.floats.length) return 0;
return this.floats[this.readOffset++];
});
channel.set(data);
return true;
}
}
// register processor in AudioWorkletGlobalScope
function registerProcessor(name, processorCtor) {
return `${processorCtor};\nregisterProcessor('${name}', ${processorCtor.name});`;
}
const worklet = URL.createObjectURL(
new Blob(
[
registerProcessor(
'speech-synthesis-stream',
SpeechSynthesisStream
),
],
{ type: 'text/javascript' }
)
);
ac.onstatechange = e => console.log(ac.state);
await ac.audioWorklet.addModule(worklet);
const aw = new AudioWorkletNode(
ac,
'speech-synthesis-stream',
{
numberOfInputs: 1,
numberOfOutputs: 1,
channelCount: 1,
processorOptions: {
readOffset: 0,
ended: false,
start: false,
floats
},
}
);
aw.onprocessorerror = e => {
console.error(e);
console.trace();
};
const msd = new MediaStreamAudioDestinationNode(ac);
const { stream } = msd;
const [track] = stream.getAudioTracks();
aw.connect(msd);
aw.connect(ac.destination);
// const recorder = new MediaRecorder(stream);
// recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
if (ac.state === 'running') {
await ac.suspend();
}
aw.port.onmessage = async e => {
console.log(e.data, ac.state);
if (
e.data.start &&
ac.state === 'suspended'
) {
await ac.resume();
// recorder.start();
} else {
// if (recorder.state === 'recording') {
// recorder.stop();
track.stop();
aw.disconnect();
msd.disconnect();
await ac.close();
console.log(track);
}
};
});