/whisper-edge

OpenAI Whisper for edge devices

Primary LanguagePythonMIT LicenseMIT

Whisper Edge

Porting OpenAI Whisper speech recognition to edge devices with hardware ML accelerators, enabling always-on live voice transcription. Current work includes Jetson Nano and Coral Edge TPU.

Jetson Nano

Jetson Nano

Shopping cart

Part Price (2023)
NVIDIA Jetson Nano Developer Kit (4G) $149.00
ChanGeek CGS-M1 USB Microphone $16.99
Noctua NF-A4x10 5V Fan (or similar, recommended) $13.95
D-Link DWA-181 Wi-Fi Adapter (or similar, optional) $21.94

Model

The base.en version of Whisper seems to work best for the Jetson Nano:

  • base is the largest model size that fits into the 4GB of memory without modification.
  • Inference performance with base is ~10x real-time in isolation and ~1x real-time while recording concurrently.
  • Using the english-only .en version further improves WER (<5% on LibriSpeech test-clean).

Hack

Dilemma:

  • Whisper and some of its dependencies require Python 3.8.
  • The latest supported version of JetPack for Jetson Nano is 4.6.3, which is on Python 3.6.
  • No easy way to update Python to 3.8 without losing CUDA support for PyTorch.

Workaround:

Setup

First, follow the developer kit setup instructions, connect the Wi-Fi adapter and the microphone to USB, and ideally install a fan. (Also plugging in an Ethernet cable helps to make the downloads faster.) Then, get a shell on the Jetson Nano:

ssh user@jetson-nano.local

We will use NVIDIA Docker containers to run inference. Get the source code and build the custom container:

git clone https://github.com/maxbbraun/whisper-edge.git
bash whisper-edge/build.sh

Run

Launch inference:

bash whisper-edge/run.sh

You should see console output similar to this:

I0317 00:42:23.979984 547488051216 stream.py:75] Loading model "base.en"...
100%|#######################################| 139M/139M [00:30<00:00, 4.71MiB/s]
I0317 00:43:14.232425 547488051216 stream.py:79] Warming model up...
I0317 00:43:55.164070 547488051216 stream.py:86] Starting stream...
I0317 00:44:19.775566 547488051216 stream.py:51]
I0317 00:44:22.046195 547488051216 stream.py:51] Open AI's mission is to ensure that artificial general intelligence
I0317 00:44:31.353919 547488051216 stream.py:51] benefits all of humanity.
I0317 00:44:49.219501 547488051216 stream.py:51]

The stream.py script run in the container accepts flags for different configurations:

bash whisper-edge/run.sh --help

       USAGE: stream.py [flags]
flags:

stream.py:
  --channel_index: The index of the channel to use for transcription.
    (default: '0')
    (an integer)
  --chunk_seconds: The length in seconds of each recorded chunk of audio.
    (default: '10')
    (an integer)
  --input_device: The input device used to record audio.
    (default: 'plughw:2,0')
  --language: The language to use or empty to auto-detect.
    (default: 'en')
  --latency: The latency of the recording stream.
    (default: 'low')
  --model_name: The version of the OpenAI Whisper model to use.
    (default: 'base.en')
  --num_channels: The number of channels of the recorded audio.
    (default: '1')
    (an integer)
  --sample_rate: The sample rate of the recorded audio.
    (default: '16000')
    (an integer)

Try --helpfull to get a list of all flags.

Troubleshooting

To see if the microphone is working properly, use alsa-utils:

sudo apt-get -y install alsa-utils

# Is the USB device connected?
lsusb

# Is the correct recording device selected?
arecord -l

# Is the gain set properly?
alsamixer

# Does a test recording work?
arecord --format=S16_LE --duration=5 --rate=16000 --channels=1 --device=plughw:2,0 test.wav

Coral Edge TPU

Coral

See the corresponding issue about what supporting the Google Coral Edge TPU may look like.