This repository contains all the work I have done (and I'm doing) in developing a web app for Speech-to-Text, based on OpenAI Whisper
- 08/12/2022: added Notebook to explain inner working of match_layers.py
- 25/11/2022: clean separation between frontend and backend
- 24/11/2022: no need anymore to change the Whisper codebase to load the custom model
- You can load and use a custom trained model, using HF Transformers
- You can enable comparison of the transcription with expected text, providing a csv file (f_name, sentence)
- You can run on a GPU, and it is way faster
- supported models: medium, large (vanilla) and medium for custom
One common use case could be that we're fine-tuning a Whisper model, for example to have higher accuracy on a special domain's language.
The fine tuning can be done using HF Transformers, using the approach described here.
In this case, the utility can be used to match and show how to load the custom tuned model in Whisper codebase.
You can find some more information on this utility in the Wiki.
I have also added a Notebook that does the matching and enables you to explore, step-by-step, how the matching is done (for example having a look at the names of the layers matched).
- Torch
- Hugging Face Transformers
- OpenAI Whisper
- Streamlit
- st-annotated-text
- soundfile
- tqdm
- pickle
- pandas
- PIL
- based on Python 3.10.6
- can be rebuilt using the provided requirements.txt
I have tested and the code works fine on a VM equipped with:
- NVIDIA GPU P100
- Ubuntu 22.04-2022.11.06
- Python 3.10
To enable the code to run on GPU you need only to set:
DEVICE = cuda
in config file.
It is, obviously, much faster running on GPU, especially with long files (> 60 sec.)
In this table I report the results of two tests done, enabling and disabling the GPU:
Test n. | Audio dur. in sec. | time on CPU (s.) | time on GPU (s.) |
---|---|---|---|
1 | 129 | 55 | 11 |
2 | 255 | 110 | 19.8 |
about 5 times faster!