-
Record to disk using OfflineAudioContext
- Simplest case??? (maybe not)
- Accomplishes goals if we can also read from the file
-
Record to disk using separate stream processing server (Node, Elixir etc)
- More similar to what CloudSpeech sees
- Easier to send stuff to cloudspeech
-
MediaRecorder API
- It's supported by Chrome and Firefox
- SEE https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API/Using_the_MediaStream_Recording_API
Let's go with (2) for the sake of simple disk writes and GoogleSpeech integration
We can only interact with the WebAudio stream from a ScriptProcessorNode, which processes stuff in discrete (time-domain) chunks.
- Can't set sample rate of AudioContext. Any resampling should happen in an offline audio context.
- BUT, we can set sample rate of MICROPHONE using constraints. One way is to hook up a microphone with desired sample rate to the webaudio graph, set its gain to zero, and roll with it :)
- We might need to downmix stereo->mono with microphone input. It's probably easiest to do this within a ScriptProcessorNode, or on the writer side. ChannelMerger does not necessarily output mono, be careful
- Downmixing (should) also happen if we define a ScriptProcessorNode which only has one input channel :)
SEE http://sox.sourceforge.net/AudioFormats.html
Playing msytery formats:
- We can play raw (headerless) audio with
sox
- Use
play
instead ofsox -d
because it's easy to overwrite files! play -r 44100 -e floating-point -b 32 -c 1 -t raw ./1487623877259
- If we set 2 channels where there are really 1, it sounds octave-doubled.
Writing WAV headers to raw audio of known format:
- Instead of writing wav headers by hand/using crappy node modules, let's use sox:
sox -r 44100 -e floating-point -b 32 -c 1 -t raw 1487626939167.raw 1487626939167.wav
- You can also specify other params such as endian order...
play --channels=1 --bits=16 --rate=16000 --encoding=signed-integer --endian=little audio.raw
Transforming audio with sox
:
-
From WAV specify new format after input filename. Replace
-d
with output file name. -
Resample:
sox bstheme-44k-f32.wav -r 8000 -e floating-point -b 32 -c 1 -d
-
Reformat:
sox bstheme-44k-f32.wav -r 44100 -e unsigned-integer -b 8 -c 1 -d
-
24-bit FLAC is supported by CloudSpeech streaming. Specifying un/signed/integer/float in the encoder causes trouble.
-
sox bstheme-44k-f32.wav -r 44100 -b 24 -c 1 -C 8 bstheme-44k-i24-c8.flac
-
LPCM16
WAV
files are just raw LPCM16 audio data prepended with a header. Add/strip this header as necessary.- Find out bitrate, sampling rate etc. of stream
- Write WAV header
- Write data...
-
FLAC stream compression supports LPCM16 (FLAC does not support floats)
- https://www.npmjs.com/package/emflac
- https://github.com/Rillke/flac.js
- Various libflac bindings if we're native...
- (decoder only) https://www.npmjs.com/package/flac.js
-
AudioBuffer
-
AudioBuffer.getChannelData()
returns aFloat32Array
-
Also has properties
sampleRate
,length
/sample frames,duration
/s,numberOfChannels
-
Easiest TypedArray conversion is via e.g. Uint16Array.from(new Float32Array([values]). But this tries to preserve intended values, which is not what we want.
-
Amplitude range of Float32 audio data is [-1,1]
-
Map [-1,1] -> [0...65535]. Is WebAudio [-1,1] or [-1,1)?
-
Streaming over websockets: See http://blog.mgechev.com/2015/02/06/parsing-binary-protocol-data-javascript-typedarrays-blobs/
Audio input can be captured by an application’s microphone or sent from a pre-recorded audio file. Multiple audio encodings are "supported," including FLAC, AMR, PCMU and Linear-16. See Google dox
Encoding | Support | Notes |
---|---|---|
ENCODING_UNSPECIFIED |
Not specified. Will return result google.rpc.Code.INVALID_ARGUMENT. | |
LINEAR16 |
Uncompressed 16-bit signed little-endian samples (Linear PCM). This is the only encoding that may be used by AsyncRecognize . |
Can it be compressed by e.g. gzip? |
FLAC |
This is the recommended encoding for SyncRecognize and StreamingRecognize because it uses lossless compression. 16-bit and 24-bit samples are supported. Not all fields in STREAMINFO are supported. |
Clearly AsyncRecognize is not StreamingRecognize ! |
MULAW |
8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. | |
AMR |
Adaptive Multi-Rate Narrowband codec. sample_rate must be 8000 Hz. | |
AMR_WB |
Adaptive Multi-Rate Wideband codec. sample_rate must be 16000 Hz. |
SEE Encoding
SEE dox
AsyncRecognize
: "Long running operation", probably most useful for offline transcription of long audio pieces (limited to 80', sync/stream limited to 1'). Pass b64 encoded raw audio data OR file stored in GoogleCloudStorage
SyncRecognize
: Functionally similar to Async, but supports more encodings + less audio time
StreamingRecognize
: Stream audio + receive streamed transcription. This is the one we want.
Websockets/node stream adapter
Node wrappers around sox
for easy transcoding