A fast, fully local AI Voicechat using WebSockets
voicechat2.webm
Unmute to hear the audio
On an 7900-class AMD RDNA3 card, voice-to-voice latency is in the 1 second range:
- Whisper large-v2 (Q5)
- Llama 3 8B (Q4_K_M)
- tts_models/en/vctk/vits (Coqui TTS default VITS models)
On a 4090, using Faster Whisper with faster-distil-whisper-large-v2 we can cut the latency down to as low as 300ms:
voicechat2-fw.webm
These installation instructions are for Ubuntu LTS and assume you've setup your ROCm or CUDA already.
I recommend you use conda or (my preferred), mamba for environment management. It will make your life easier.
sudo apt update
# Not strictly required but the helpers we use
sudo apt install byobu curl wget
# Audio processing
sudo apt install espeak-ng ffmpeg libopus0 libopus-dev
# Create env
mamba create -y -n voicechat2 python=3.11
# Setup
mamba activate voicechat2
git clone https://github.com/lhl/voicechat2
cd voicechat2
pip install -r requirements.txt
# Build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
# AMD version
# -DGGML_HIP_UMA=ON to work with APUs (but hurts dGPU perf)
GGML_HIPBLAS=1 make -j
# Nvidia version
GGML_CUDA=1 make -j
# Get model - large-v2 is 3094 MB
bash ./models/download-ggml-model.sh large-v2
# Quantized version - large-v2-q5_0 is 1080MB
# bash ./models/download-ggml-model.sh large-v2-q5_0
# If you're going to go to the next instruction
cd ..
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# AMD version
make GGML_HIPBLAS=1 -j
# Nvidia version
make GGML_CUDA=1 -j
# Grab your preferred GGUF model
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
# If you're going to go to the next instruction
cd ..
mamba activate voicechat2
pip install TTS
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install -r requirements.txt
pip install phonemizer
# Download the LJSpeech Model
# https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main
# https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main
pip install huggingface_hub
huggingface-cli download --local-dir . yl4579/StyleTTS2-LJSpeech
Some extra convenience scripts for launching:
run-voicechat2.sh - on your GPU machine, tries to launch all servers in separate byobu sessions
remote-tunnel.sh - connect your GPU machine to a jump machine
local-tunnel.sh - connect to the GPU machine via a jump machine
The demo shows a fair amount of latency (~10s) but this project isn't the closest to what we're doing (it uses WebRTC not websockets) from voicechat2 (HF Transformers, Ollama)
A console-based local client (HF Transformers, Ollama, Coqui TTS, PortAudio)
This is a very responsive console-based local-client app that also has VAD and interruption support, plus a really clever hook! (whisper.cpp, llama.cpp, piper, espeak)
Another console-based local client, more of a proof of concept but with w/ blog writeup.
- https://github.com/vndee/local-talking-llm
- https://blog.duy.dev/build-your-own-voice-assistant-and-run-it-locally/
- MIT
Another console-based local client (FastConformer, HF Transformers, StyleTTS2, espeak)
KoljaB has a number of interesting projects around console-based local clients like RealtimeSTT, RealtimeTTS, Linguflex, etc. (faster_whisper, llama.cpp, Coqui XTTS)
- https://github.com/KoljaB/LocalAIVoiceChat
- NC (Coqui Model License)
This is not a local voicechat client, but it does have a neat WebRTC front-end, so might be worth poking around into (Vite/React, Tailwind, Radix)