Whisper-App

This repository contains all the work I have done (and I'm doing) in developing a web app for Speech-to-Text, based on OpenAI Whisper

Updates

08/12/2022: added Notebook to explain inner working of match_layers.py
25/11/2022: clean separation between frontend and backend
24/11/2022: no need anymore to change the Whisper codebase to load the custom model

Features

You can load and use a custom trained model, using HF Transformers
You can enable comparison of the transcription with expected text, providing a csv file (f_name, sentence)
You can run on a GPU, and it is way faster
supported models: medium, large (vanilla) and medium for custom

Utility

match_layers

One common use case could be that we're fine-tuning a Whisper model, for example to have higher accuracy on a special domain's language.

The fine tuning can be done using HF Transformers, using the approach described here.

In this case, the utility can be used to match and show how to load the custom tuned model in Whisper codebase.

You can find some more information on this utility in the Wiki.

I have also added a Notebook that does the matching and enables you to explore, step-by-step, how the matching is done (for example having a look at the names of the layers matched).

Libraries used

Torch
Hugging Face Transformers
OpenAI Whisper
Streamlit
st-annotated-text
soundfile
tqdm
pickle
pandas
PIL

Environment

based on Python 3.10.6
can be rebuilt using the provided requirements.txt

Running on GPU

I have tested and the code works fine on a VM equipped with:

NVIDIA GPU P100
Ubuntu 22.04-2022.11.06
Python 3.10

To enable the code to run on GPU you need only to set:

DEVICE = cuda

in config file.

It is, obviously, much faster running on GPU, especially with long files (> 60 sec.)

In this table I report the results of two tests done, enabling and disabling the GPU:

Test n.	Audio dur. in sec.	time on CPU (s.)	time on GPU (s.)
1	129	55	11
2	255	110	19.8

about 5 times faster!