possibly relevant
NotBrianZach opened this issue · 5 comments
Great project!
- The VAD WhisperX uses is superior to the simple level detection I wrote. But they can work together, obviously. Perhaps it can prevent Whisper from interpreting background noise as "you" all the time. I'll have to try it.
- The timestamps I can't use for realtime dictation. Their examples look cool flashing by as video captions, but I'm trying to keep this project unobtrusive and running in the background as much as possible.
- Speaker diarization and timestamps are more-suited for making captions, which I don't need here. But if there is a pressing demand for something of a separate server, GUI interface, or "app for that", I might just code it... Are you looking for something to do real-time video captions of your Zoom meetings? What. Anyway, let me know.
- The speedup is not that much for my application.
- The command-line option for downloading models suitable to recognize other languages is nice. The way we do it is by editing the python file. But yes, that could be a an option!
- Memory requirements could be higher. We're trying to keep them down.
- not at all fast with my video card
- their code won't run, has a lot of errors with ffmpeg module that need fixing
- constantly phones the internet. This project is suppoed to work offline.
fair, just a heads up I wasn't sure, this sh** moves so quick.
I just want to talk to my big computer with gpu (running some llama derivative if I get it set up in the near future probably) over a bluetooth speaker/mic. E.g. bone conduction headphones. Preferably from anywhere maybe by connecting from phone wifi hotspot or something. (not a feature request just describing what I intend to do).
or at least thats the near term goal no idea what could use it for in the future lol
(e.g. another thing I am working on is moving a bunch of my personal stuff into supabase, ideally in the future I will be able to sync shell history between machiens atuin style, use it as my password manager, etc, and then I could talk to my bone conduction headphones about my shell history or browsing history without having to look at it lol) (and mediate what it can/can't see with schemas/row level security/views and linux user access controls)
though another thing is just voice typing being able to write code hands free; i've gotten cubital&carpal tunnel before (cubital maybe a bit of emacs pinky). pretty much everything is my use case XD. Could be useful to spy on people as well XD
If you export an OPENAI_API_KEY, GPT can write code to some extent if you tell it, for example, "Computer, write python code to open a file, just the code, no comments."
It's using middle click paste on Linux. So if you have the wrong window focused and text goes into the wrong place, just undo it and middle click where the text was supposed to go.