Visitors

Live Transcription with Whisper PoC in Server - Client setup

Live transcription PoC with the Whisper model (using faster-whisper) in a server (restapi) - client (gradio ui/cli) setup where the server can handle multiple clients.

(Server is running separately making it usable with any client side code.)

live_transcribe_poc

Sample

Sample with a Macbook Pro (M1)

test-transcription-on-m1-mac.mov

(🔈 sound on, faster-whisper package, base model - latency was around 0.5s)

Setup

  • $ pip install -r requirements.txt
  • $ mkdir models

Run

  • Before running the server.py modify the parameters inside the file

Gradio interface

# Start the server (RestAPI)
python server.py

# --------------------------------

# Start the Gradio interface on localhost (HTTP)
python ui_client.py

# Start the Gradio interface with their sharing - this way the it'll be HTTPS without the need of certs
SHARE=1 python ui_client.py

# Start the Gradio interface with your own certs
SSL_CERT_PATH=<PATH> SSL_KEY_PATH=<PATH> python ui_client.py

In the command line

python server.py
python cli_client.py

There are a few parameters in each script that you can modify

How does it work?

This beautiful art will explain this:

- step = 1
- length = 4

$t$ is the current time (1 second of audio to be precise)

------------------------------------------
1st second: [t,   0,   0,   0] --> "Hi"
2nd second: [t-1, t,   0,   0] --> "Hi I am"
3rd second: [t-2, t-1, t,   0] --> "Hi I am the one"
4th second: [t-3, t-2, t-1, t] --> "Hi I am the one and only Gabor"
5th second: [t,   0,   0,   0] --> "How" --> Here we started the process again, and the output is in a new line
6th second: [t-1, t,   0,   0] --> "How are"
etc...
------------------------------------------

Improvements

  • Use a VAD on the client side, and either send the audio for transcription when we detect a longer silence (e.g. 1 sec) or if there is no silence we can fall back to the maximum length.
  • Transcribe shorter timeframes to get more instant transcriptions and meanwhile, we can use larger timeframes to "correct" already transcribed parts (async correction)