Tool to create audio to text macro.
The docker image can be built using ./Dockerfile
. You can build it using the following command, run from the root directory
sudo docker build --build-arg HF_API_KEY=<your_huggingface_api_key> . -f Dockerfile --rm -t audio-macro-creator:latest
First navigate to this repo on your local machine. Then run the container:
sudo docker run --gpus all --name audio-macro-creator -it --rm -p 8888:8888 -p 8501:8501 -p 8000:8000 --entrypoint /bin/bash -w /audio-macro-creator -v $(pwd):/audio-macro-creator audio-macro-creator:latest
Inside the Container:
jupyter lab --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.token=''
Host machine access this url:
localhost:8888/<YOUR TREE HERE>
Inside the container:
streamlit run app.py
Host machine:
localhost:8501
Once the streamlit service is up and running
ngrok http https://localhost:8501
Unit tests can be found in /tests
. Please note that relative imports depend on the context in which you run your script. If you run test_transcription_w_macro_app.py
as a script, Python will not be able to resolve the relative import. To avoid this, you should run your tests using the -m
flag from the command line, like so: python -m unittest tests/test_transcription_w_macro_app.py
. This will correctly set the context for the relative import. Since all the unit tests are in /tests
, you can also use python -m unittest discover -s tests
We need a automatic speech recognition model with low latency inference. Try the following models:
- distil-whisper/distil-large-v2
- X openai/whisper-large-v3
- X facebook/wav2vec2-base-960h
- X srujan00123/wav2vec2-large-medical-speed
- X save model to disk
- Look into real-time transcription
- Useful resources:
- X Use session states to allow saving without needing to rerun the entire script
- If macro key phrase is longer than 4 words, just check that the first 4 words match. Or throw an error if there are any macros with more than 4 words.
- X Add notebooks to experiment with models
- Add dropdown for different model options
- Create separate
findings
andconclusions
sections so that, with enough data, we can try to infer the conclusions section from the findings.
For transcription with macro application:
- X add fuzzywuzzy, st_audiorec, and word2number to docker image
- X add dockerfile
- X add instructions for ngrok hosting
- X save the raw transcriptions along with the final (macro inserted) transcription
- X add ability to edit the final transcriptions, save the raw-final pair for training data
- X we should save the asr model, timestamp, raw, final, id, raw audio with id
When running the streamlit service on a VM, using the microphone of the host machine you may get a "Component Error Cannot read properties of null (reading 'getAudioTracks')". This is likely related to the browser's security settings and the fact that streamlit runs on http.
Here is how to solve the issue:
- Open a terminal on your Ubuntu VM.
- Run the following commands to generate a private key:
openssl genrsa 2048 > host.key
chmod 400 host.key
- Run the following command to generate a self-signed certificate. You'll be prompted for some information - you can just hit Enter to accept the defaults:
openssl req -new -x509 -nodes -sha256 -days 365 -key host.key -out host.cert
- Run your Streamlit app over HTTPS with the following command:
streamlit run transcription_w_macro_app.py --server.sslCertFile host.cert --server.sslKeyFile=host.key
Now, you should be able to access your Streamlit app over HTTPS, and your browser should allow access to the microphone. Note that because the certificate is self-signed, your browser will warn you that the connection is not secure. You'll need to manually accept the risk and proceed to the site.
One big downside of using the streamlit app above is the high latency of getting a transcription from an audio recording. A big source of that is the time it takes the streamlit app to process the audio. In addition we don't start transcribing the audio until the entire audio is recorded and processed. VoiceStreamAI allows us to perform near real-time audio transcription, significantly reducing the latency in the streamlit app.
Instructions to build/run the app can be found in ./VoiceStreamAI/README.md
. One thing to note about the VAD token referenced is that you will need to share your contact information with pyannote here, and then the token to use will be your huggingface token (see instructions under TL;DR here for more details).
Explore using Open WebUI as an option for real time speech to text.