/KALO-ESP32-Voice-Assistant

Code snippets showing how to record I2S audio and store as .wav file on ESP32 with SD card, how to transcribe pre-recorded audio via Deepgram SpeechToText API, how to generate audio from text via TextToSpeech API from OpenAI a/o Google TTS. Triggering ESP32 actions via Voice.

Primary LanguageC++

Summary

Code snippets showing how to record I2S audio and store as .wav file on ESP32 with SD card, how to transcribe pre-recorded audio via STT (SpeechToText) Deepgram API, how to generate audio from text via TTS (TextToSpeech) API from OpenAI a/o Google TTS. Triggering ESP32 actions via Voice.

The repository contains the Demo main sketch 'KALO_ESP32_Voice_Assistant.ino', demonstrating different use case of my libraries 'lib_audio_recording.ino' and 'lib_audio_transcription.ino'

Features

Explore the demo use case examples (1-6) in main sketch, summary:

  • Recording and playing audio are working offline, online connection needed for STT, TTS and streaming services
  • Recording Voice Audio with variable length (recording as long a button is pressed), storing as .wav file (with 44 byte header) on SD card
  • Replay your recorded audio (using Schreibfaul1 <audio.h> library)
  • Playing Audio streams (e.g. playing music via radio streams with <audio.h> library)
  • STT (SpeechToText), using Deepgram API service (registration needed)
  • TTS (TextToSpeech), supporting multilingual 6 voices via Open AI API (registration needed)
  • TTS (TextToSpeech), using Google TTS API (no registration needed)
  • Triggering ESP actions via voice (e.g. triggering GPIO LED pins, addressing dedicated voices by calling their name, playing music on request)

Hardware

  • ESP32 development board (e.g. ESP32-WROOM-32), connected to Wifi
  • I2S digital microphone, e.g. INMP441 [I2S pins 22, 33, 35]
  • I2S audio amplifier, e.g. MAX98357A [I2S pins 25,26,27] with speaker
  • Micro SD Card [VSPI Default pins 5,18,19,23]
  • RGB LED (status indicator) and Analog Poti (audio volume)

Installation & Customizing

  • Required: Arduino IDE with ESP32 libray 3.0.x (based on ESP-IDF 5.1). Older 2.x ESP framework fail because new I2S driver missed
  • Required (for playing Audio on ESP32): AUDIO.H library ESP32-audioI2S.zip. Install latest zip (3.0.11g from July 18, 2024 or newer)
  • Copy all 3 .ino files of 'KALO-ESP32-Voice-Assistant' into same folder (it is one sketch, split into 3 Arduino IDE tabs)
  • Update your pin assignments in the header of all 3 .ino files
  • Insert your credentials (ssid, password, OpenAI API key, Deepgram API key)
  • Define your favorite recording settings (SAMPLE_RATE, BITS_PER_SAMPLE, GAIN_BOOSTER_I2S) in lib_audio_recording.ino header
  • Define your language settings (Google TTS in KALO_ESP32_Voice_Assistant.ino, Deepgram STT in lib_audio_transcription.ino header)
  • Toggle DEBUG flag to true (displaying Serial.print details) or false (for final usage)

Known issues

  • WifiClientSecure connection not reliable (assuming RAM heap issue in WifiClientSecure.h library), rarely freezing (e.g. after 10 mins)

Updates

  • 2024-07-22: Misc. enhancements, STT connection reliablility improved further, code cleaned up
  • 2024-07-14: Updated version:
    • WifiClientSecure connection reliablility improved (still not perfect)
    • STT Deepgram response faster (new total response time average on e.g. 5 sec voice record: ~ 2.5 sec). Recommendation: It's worth trying 8Khz/8bit once, STT response ~1 sec (Note: Using complete sentences instead of single words improves recognition quality)
    • user language settings (STT & TTS) added, bug fixing etc.
  • 2024-07-08: First drop, already working, not finally cleaned up (just posted this drop on some folks request)

Next steps

  • Code cleanup, regular updates .. ongoing
  • Review & improve reliability of WifiClientSecure connection .. ongoing
  • Fixing 'Play 8bit audio' issue - Done (2024-07-18), latest AUDIO.H (since 2024-07-18) supports 8bit wav format
  • Adding more use case examples in main sketch
  • Including SpeechGen.IO TTS API call (hundreds of additional voices). Coded already, unfortunaltly failed since ESP 3.x framework update
  • Including a OpenAI API library with demo code, using an ESP32 as Voice ChatGPT device

. . .

Demo Videos

Short video clip, presenting Recording & SpeechToText & TextToSpeech (without Open AI, ESP32 is not 'answering', just parroting my voice). Workflow:

  • Recording user voice, storing audio .wav file (8KHz/8bit) to SD card,
  • STT: transcribe pre-recorded voice via Deepgram API,
  • TTS: repeat spoken sentence with Goggle TTS voice (a/o triggering e.g. LED via voice):

Video Screenshot


Featured video from other users & friends:
@techiesms: using my Deepgram transcription STT library in his IoT projects:
https://www.youtube.com/watch?v=j0EEFXmikvk-