TextboxSTT

A SpeechToText application that uses OpenAI's whisper via faster-whisper to transcribe audio and send that information to VRChats textbox system and/or KillFrenzyAvatarText over OSC. Also supports OBS via Browsersource and a SteamVR overlay!

Note

This program is designed to be completely free of charge, open source, and independent from Cloud-Based Transcription services such as Microsoft Azure. It accomplishes this by utilizing transcription algorithms that run on your own hardware, thereby upholding privacy, enhancing latency, and ensuring reliability. As a result, I will not be incorporating any cloud-based transcription or translation services into this program.

Discord Support Server

🢃 Download Latest Release

Features

Sending transcription to either:
- VRChats Ingame Textbox allowing for use with any avatar.
- KillFrenzyAvatarText (KAT) that needs to be integrated to an avatar.
  - You can use Frosty704's Billboard to add a speech bubble to your avatar.
  - Support for up to 80 emotes!
  - Automatic Detection of KAT on an avatar. It will use KAT if available, otherwise fall back to VRChat Textbox.
- OBS over Browser Source!
- Websockets
SteamVR Overlay for seeing your transcription without having to look at your own textbox in-game.
Fast and Efficient. VRCTextboxSTT uses ctranslate2 as the runtime for transcription and translation, which makes it incredibly efficient and fast.
Uses Steam Input, press to transcribe, hold to clear/cancel (A/X by default). Also works on desktop with the "F1" Key by default.
Customizable
- You can bind the button to start transcription to any action that SteamVR allows you to set.
- You can bind it to any key on your keyboard.
- Many Timing settings for transcription delays and button presses.
- Multiple different Transcription modes to choose from.
- You can change all of the Audio feedback sounds to a sound of your liking.
Ability to to use fine tuned models from Huggingface
Automatic launch with SteamVR.
Text to Text for quick typing.
Simple API. latest transcription bound to the "/transcript" endpoint. (Requires OBS Source to be turned on)
Audio feedback for each step in the transcription.
- Volume for each of the feedbacks can be modified over the Settings menu.
Multi Language support. whisper supports around 100 different languages.
- Translate into and from those different languages. (Powered by M2M100)
Word Replacements and Emote Replacements with Regular Expressions.
Free to use as of the GPL-3.0 license
Completely free of Subscription/Cloud Services, by running locally on your hardware.
Runs completely offline, besides downloading models/dependencies and updates/update checks

Limitations

Limited character availability
- VRChats Textbox is limited to showing 144 characters at a time.
- KillFrenzyAvatarText does support ASCII characters and a certain set of Japanese hiragana.
  Limited to showing 128 characters at a time.
Visibility
- VRChats Textbox is only visible to friends by default, consider telling people they can change that in VRChats settings.
- VRChats Textbox is not visibile in Streamer-Mode.
- KillFrenzyAvatarText is only visible to shown avatars and is PC only, as it uses a custom shader setup.

Requirements

With default settings, this program has following requirements:

.NET 4.8.1 (Should be preinstalled on Windows 10 and up)
Visual C++ 2015-2022 Redistributable (x64)
SteamVR (IF ran in VR, no Oculus/Meta support as of now.)
Inference on GPU (Recommended):
- CUDA enabled GPU (So NVIDIA only, but you can try your luck with something like ZLUDA for AMD GPUs), otherwise it will fall back to using CPU.
- ~11GB of available space for installation, ~6GB of space used after successful installation and loading models.
- ~1GB of available RAM.
- ~320MB of available VRAM.
Inference on CPU:
- ~4GB of available space for installation, ~2GB of space used after successful installation and loading models.
- ~400MB of available RAM.

Note

Depending on settings changed in the program those requirements can change drastically.
VRAM usages per Model: (int8 Precision. English models only)
~200MB with tiny.en
~220MB with base.en
~320MB with distil-small.en
~380MB with small.en
~580MB with distil-medium.en
~900MB with medium.en
~900MB with distil-large-v2
~1.6GB with large-v2

Demo

Frosty704 using VRCTextboxSTT and KillFrenzyAvatarText with their Billboard project. More to that on their repository.

Documentation (In Progress)

Backlog

Similar Projects

There are similar projects that already exist that you might want to consider using

RabidCrab's STT incurs a monetary cost and relies on cloud-based transcription services, which inherently tend to be slower and less reliable compared to local transcription methods.
VRCWizard's TTS-Voice-Wizard employs a wide array of transcription methods, encompassing both local and cloud-based approaches. Furthermore, it offers support similar to that of KAT, as seen in this project. Beyond functioning solely as a Speech To Text program, TTS-Voice-Wizard boasts a range of additional, noteworthy features. It may be worth your while to explore this tool further.
yum-food's TaSTT This project is spiritually and philosophically very close to this project, they have very feature rich avatar text solution that supports more characters then KAT does. They have made great progress on this problem, definitely take a look at it!

Support this Project

You can always leave a Github Star 🟊 (It's free) or buy me a coffee:

Credit

OpenAI for their amazing work with anything really.
SYSTRAN/faster-whisper and ctranslate2, their work makes this project much more efficent and faster then it otherwise would be.
ValveSoftware/openvr and cmbruns/pyopenvr
Uberi/speech_recognition and jleb/pyaudio
killfrenzy96 for KillFrenzyAvatarText and KatOSC
Frosty704's Billboard for making this project more useful.
cyberkitsune's OSCQuery implementation because i was too lazy to do that myself xD

I5UCC/VRCTextboxSTT