/rhasspy3

An open source voice assistant toolkit for many human languages

Primary LanguagePythonMIT LicenseMIT

Rhasspy 3

NOTE: This is a very early developer preview!

An open source toolkit for building voice assistants.

Voice assistant pipeline

Rhasspy focuses on:

  • Privacy - no data leaves your computer unless you want it to
  • Broad language support - more than just English
  • Customization - everything can be changed

Getting Started

Missing Pieces

This is a developer preview, so there are lots of things missing:

  • A user friendly web UI
  • An automated method for installing programs/services and downloading models
  • Support for custom speech to text grammars
  • Intent systems besides Home Assistant
  • The ability to accumulate context within a pipeline

Core Concepts

Domains

Rhasspy is organized by domain:

  • mic - audio input
  • wake - wake word detection
  • asr - speech to text
  • vad - voice activity detection
  • intent - intent recognition from text
  • handle - intent or text input handling
  • tts - text to speech
  • snd - audio output

Programs

Rhasspy talks to external programs using the Wyoming protocol. You can add your own programs by implementing the protocol or using an adapter.

Adapters

Small scripts that live in bin/ and bridge existing programs into the Wyoming protocol.

For example, a speech to text program (asr) that accepts a WAV file and outputs text can use asr_adapter_wav2text.py

Pipelines

Complete voice loop from microphone input (mic) to speaker output (snd). Stages are:

  1. detect (optional)
    • Wait until wake word is detected in mic
  2. transcribe
    • Listen until vad detects silence, then convert audio to text
  3. recognize (optional)
    • Recognize an intent from text
  4. handle
    • Handle an intent or text, producing a text response
  5. speak
    • Convert handle output text to speech, and speak through snd

Servers

Some programs take a while to load, so it's best to leave them running as a server. Use bin/server_run.py or add --server <domain> <name> when running the HTTP server.

See servers section of configuration.yaml file.


Supported Programs


HTTP API

http://localhost:13331/<endpoint>

Unless overridden, the pipeline named "default" is used.

  • /pipeline/run
    • Runs a full pipeline from mic to snd
    • Produces JSON
    • Override pipeline or:
      • wake_program
      • asr_program
      • intent_program
      • handle_program
      • tts_program
      • snd_program
    • Skip stages with start_after
      • wake - skip detection, body is detection name (text)
      • asr - skip recording, body is transcript (text) or WAV audio
      • intent - skip recognition, body is intent/not-recognized event (JSON)
      • handle - skip handling, body is handle/not-handled event (JSON)
      • tts - skip synthesis, body is WAV audio
    • Stop early with stop_after
      • wake - only detection
      • asr - detection and transcription
      • intent - detection, transcription, recognition
      • handle - detection, transcription, recognition, handling
      • tts - detection, transcription, recognition, handling, synthesis
  • /wake/detect
    • Detect wake word in WAV input
    • Produces JSON
    • Override wake_program or pipeline
  • /asr/transcribe
    • Transcribe audio from WAV input
    • Produces JSON
    • Override asr_program or pipeline
  • /intent/recognize
    • Recognizes intent from text body (POST) or text (GET)
    • Produces JSON
    • Override intent_program or pipeline
  • /handle/handle
    • Handles intent/text from body (POST) or input (GET)
    • Content-Type must be application/json for intent input
    • Override handle_program or pipeline
  • /tts/synthesize
    • Synthesizes audio from text body (POST) or text (GET)
    • Produces WAV audio
    • Override tts_program or pipeline
  • /tts/speak
    • Plays audio from text body (POST) or text (GET)
    • Produces JSON
    • Override tts_program, snd_program, or pipeline
  • /snd/play
    • Plays WAV audio via snd
    • Override snd_program or pipeline
  • /config
    • Returns JSON config
  • /version
    • Returns version info

WebSocket API

ws://localhost:13331/<endpoint>

Audio streams are raw PCM in binary messages.

Use the rate, width, and channels parameters for sample rate (hertz), width (bytes), and channel count. By default, input audio is 16Khz 16-bit mono, and output audio is 22Khz 16-bit mono.

The client can "end" the audio stream by sending an empty binary message.

  • /pipeline/asr-tts
    • Run pipeline from asr (stream in) to tts (stream out)
    • Produces JSON messages as events happen
    • Override pipeline or:
      • asr_program
      • vad_program
      • handle_program
      • tts_program
    • Use in_rate, in_width, in_channels for audio input format
    • Use out_rate, out_width, out_channels for audio output format
  • /wake/detect
    • Detect wake word from websocket audio stream
    • Produces a JSON message when audio stream ends
    • Override wake_program or pipeline
  • /asr/transcribe
    • Transcribe a websocket audio stream
    • Produces a JSON message when audio stream ends
    • Override asr_program or pipeline
  • /snd/play
    • Play a websocket audio stream
    • Produces a JSON message when audio stream ends
    • Override snd_program or pipeline