Based on Rudrabha/Wav2Lip and wrapped in js for Silly Tavern by Mozer
An extension that makes video messages with lipsync to audio from TTS. Silero TTS and Coqui XTTSv2 are supported.
wav2lip_en.mp4
New video of real time usage in Silly Tavern with STT and XTTSv2 in English.
И еще одно видео: на русском языке, есть немного мата.
Video guide how to set up everything in English by MustacheAI.
Works with input videos and images. Please notice that for static images only the lips will be animated. Real videos as input are more realistic. Anime pics/vids are not so good looking and sometimes face is not detected.
Low res real vids are the best in terms of realism and performance, i suggest using 300x400 10-60 seconds long 25fps input videos.
Don't put 1080p vids in input as they can cause OOM errors. Automatic resizing is not done yet (TODO). Resize and cut vids manually.
Original Rudrabha/Wav2Lip model was built for low res vids and is fast. There are other lipsync models like Wav2Lip GAN or Wav2LipHD or SadTalkerVideo but they are slower.
- 2024.01.01 - bug fixes, paths for linux
- 2023.12.24 - faster-whisper STT (speech recognition) is now supported (CPU and GPU are both fast)
- 2023.12.23 - XTTSv2 is now supported, it has amazing TTS quality
- 2023.12.23 - Settings are now in GUI, added experimental live mode to mimic live video streams
- 2023.11.22 - CPU inference is also very fast with caching! (1 second for a short answer, 15 seconds for 11 second long input audio)
- 2023.11.21 - Caching for face detection. Generation speed for cached vids is now almost 2x faster (2 seconds for a short answer, 10 seconds for 11 second long input audio)
- CPU with 10+ Gb RAM or nvidia GPU with 8+ GB VRAM. If you have Radeon GPU please use CPU, it is also fast and is turned on by default.
- latest Silly Tavern staging branch (https://github.com/SillyTavern/SillyTavern/tree/staging)
- latest Silly Tavern Extras (https://github.com/SillyTavern/SillyTavern-Extras)
- ffmpeg should be installed and put into your PATH environment (https://phoenixnap.com/kb/ffmpeg-windows)
- Works nicely on CPUs and nvidia GPUs. AMD GPUs are not tested. You can try (ROCm for linux?), they might work. Please report if they do.
- If you don't have much VRAM please use CPU (turned on by default). Min VRAM: 6 GB for 300x400 input video and short audio. Static input images may require less VRAM (how much?). Hi-res input videos/images and longer audios require more VRAM. Please report if you are able to run it with less VRAM
- If your LLM model is also in VRAM it can cause to OOM error or result in slower replies if you have shared VRAM.
- I tested wav2lip running on GPU with 3060 12GB and was able to have ruGPT3.5-13B-gptq fully loaded into VRAM using autoGPTQ. But sometimes with longer replies (4+ sentences) it went into using shared VRAM and caused drastic drop in video gen speed.
- Default silero ui in ST doesn't support other languages, just English and doesn't support prosody (voice speed and pitch). Now it can be fixed with my patch.
- By default CPU is used for inference. You can change it in extension settings.
- Video generation takes some time (about 5-10s). If you use GPU and your LLM is also in VRAM don't ask it anything during video generation or you can get OOM error.
Two steps here = face detection + lips movement. face detection results are always the same for the same input video, so we can cache them. Face detection runs rather slow on CPU (10s video = 60s of face detection) rather then GPU (7s), but when all vids have cached face detection - you can use CPU only almost as fast as a GPU! (1 second for a short answer). Cache is made automatically when input video is used for the first time. Cached times are ~2x faster. CPU here is Ryzen 7 7730U, GPU is nvidia rtx 3600 12 GB.
Inference time for 300x400 10s 25fps input video, no other stuff in vram.
device audio,s gen,s face_det VRAM,Gb
CPU 1 55 not cached
CPU 1 1 cached
CPU 11 15 cached
CPU 120 140 cached
GPU 1 8 not cached 6
GPU 1 2 cached 6
GPU 11 15 cached 8
GPU 31 32 not cached 11.1
GPU 44 103 not cached 13.2 used shared vram
Just face detection
device input_video,s face_det,s
CPU 10 55
CPU 30 435 (448x300 25 fps)
GPU 30 13
- Launch and Open Silly Tavern in browser -> Extensions (at top menu) -> Install extension, paste and save:
https://github.com/Mozer/wav2lip_extension
- Activate conda environment if you are using it. Type in command line:
cd SillyTavern-extras\modules\
git clone https://github.com/Mozer/wav2lip
cd wav2lip
pip install -r requirements.txt
Wait while all the dependencies are installed. If there are errors - fix them manully or open an issue.
- Double click
\SillyTavern\public\scripts\extensions\third-party\wav2lip_extension\patch_silly_tavern.py
to patch some original Silly Tavern files. Backups are saved as .bkp files. If you want to restore them run restore_silly_tavern.py - Double click
\SillyTavern-Extras\modules\wav2lip\patch_silly_tavern_extras.py
to patch some original Silly Tavern Extras files. If you want to restore them run restore_silly_tavern_extras.py - Restart Silly tavern Extras if it was running. And you are good to go with Silero TTS, it is fast. But i recommend using Cocqui XTTSv2 multilingual it is just a bit slower, but way more realistic.
- Official guide how to install and run XTTSv2 in Silly Tavern staging with conda, use it: https://docs.sillytavern.app/extras/extensions/xtts/
- Note: it can also be installed without conda, and without downgrading python and pytorch, simply install the full version of Visual C++ Build Tools. I'm running everything in Python 3.11.5, pytorch 2.1.2+cu121
- To run xtts server you should use this command if you have nvidia card (2 seconds for an average voice message):
python -m xtts_api_server -d=cuda --deepspeed --lowvram --output c:\\SillyTavern-Extras\\
- To run on CPU (20 seconds for a short voice message) please use command:
python -m xtts_api_server -d=cpu --output c:\\SillyTavern-Extras\\
Note: wav2lip doesn't work with xtts --streaming-mode-improve param as it doesn't save audio to file (but is fast).
5. Replace c:\\SillyTavern-Extras\\
with full path to your SillyTavern-Extras folder, it is needed to pass xtts audio file to Wav2lip. Full command can be put into a .bat file, so you won't need to type it every time.
-
Install official extension 'Extension-Speech-Recognition': Silly Tavern -> Extensions -> Download Extensions and Assets -> connect button -> yes -> Speech Recognition -> download button
It has built in streaming support for openai/whisper, but it is not working nicely, skips a lot of words, not working with Rusian language and runs on a GPU.
SYSTRAN/faster-whisper is much faster and can be run on a CPU. I am using GUI for faster-whisper from https://github.com/reriiasu/speech-to-text
-
open a cmd in directory, where you want it to be installed and run
git clone https://github.com/reriiasu/speech-to-text
pip install -r requirements.txt
- By default it launches a web gui on port 8000, the same used by SillyTavern, need to change it in
\speech-to-text\speech_to_text\__main__.py
fromeel.start("index.html", size=(1024, 1024), close_callback=on_close)
toeel.start("index.html", size=(1024, 1024), close_callback=on_close, port=8080)
- Now run it with next command. For convenience create a .bat file with contents:
python -m speech_to_text
- It will open a web GUI. Change following settings: App settings - check "Use Websocket server", uncheck "Create Audio File", set "Silence limit" to 20
- Model settings - select Model size "small", set Device to "cpu", set "Compute type" to float32, set Number of workers from 4 to 8 (how many cpu cores you want to use, i prefer 8)
- Transcribe settings - select language, e.g. "russian", task - "transcribe"
- Now you can run stt server with wss support, click Start Transcription. If you want to work it faster, try with a cuda gpu. But CPU is also fast (it takes ~2 seconds to transcribe)
- To speed up VAD a little: in
\speech-to-text\speech_to_text\utils\audio_utils.py
changeCHUNK = 512
toCHUNK = 256
- Double click
\SillyTavern\public\scripts\extensions\third-party\wav2lip_extension\patch_streaming_stt.py
to patch 2 files (index.js and streaming.js) in \Extension-Speech-Recognition\ - Silly Tavern GUI -> Extensions -> Speech Recognition -> Select Speech-to-text Provider as "streaming", set your language, set desired "Message mode" (I prefer "Auto send"). You are good to go.
- If SillyTavern lost connection with STT server, you can switch Provider to "none" and back to "streaming", it will reconect the wss connection (or just hit F5).
- Enable wav2lip and silero-tts modules for silly extras and start it using command line or conda. (silero-tts module is optional, you can try xttsv2 in Silly, streaming-stt is also optional)
python server.py --enable-modules silero-tts,wav2lip,streaming-stt
- Enable wav2lip in web interface: Extensions -> Wav2lip -> Enabled.
- Make sure Silly Tavern is "Connected to API" of extras server. Make sure TTS is enabled in extensions and settings.
- Make sure voice is selected for current character or default character. Turn on TTS auto generation if you also want video auto generation.
- Make sure language of the characted in dialogue is the same as language in "silero_api_server\tts.py" and "SillyTavern-Extras\server.py". e.g. Russian TTS won't play English words and vice versa.
- Put your short (~10-30s) and low-res (~300x400) input vids/pics into
\SillyTavern-Extras\modules\wav2lip\input\default\
They will be played in random order. Face should be present in all frames or it will cause error (e.g. covered with hand). - DONE. Now in chat you can click a video camera icon near any message to generate a video responce, or turn on automatic video generation in Extension settings.
Mode
:
There are two modes for this extension: 'video message' and 'live stream' (you can switch them in extension settings).
- video message - character will send you a video message in chat
- live stream - mimicks a live video stream like twitch. First a regular video message is played. And then a silence video will be played right after the character finished talking. You should make such video manually, just find/create a video where your chracter is not speaking anything, longer videos are better looking. Put silence.mp4 to
\SillyTavern-Extras\modules\wav2lip\input\default\
. Changedefault
to your char name if needed.
Char folder
: You can organize your char vids into folders in \SillyTavern-Extras\modules\wav2lip\input\
and them switch them in SillyTavern Extension Settings.
- User setting to limit input audio length to prevent OOM (optional input)
- Resize input vids/pics automatically (optional checkbox)
- Disable sending a message to LLM while video is generating (optional checkbox in settings)
If you have bugs or proposals please open a bug report or a pull request
discord: https://discord.gg/DZnCnGsJ