Welcome to the Voice-to-Text (Whisper) API


First, you must install Python dependencies:

pip install -r requirements.txt

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg


Start Server

python server.py

You can set the host and port of the service:

python server.py --host --port 8888

Use Server

curl -F "file=@examples/en.mp3"

You should do this when you want to use other models:

# Use `base`
curl -X POST -F "file=@examples/en.mp3" -F "model_type=base"

# Use `base.en`
curl -X POST -F "file=@examples/en.mp3" -F "model_type=base.en"

Comparison of different models:

model_type Required VRAM Parameters Relative speed
tiny.en or tiny ~1 GB 39 M ~32x
base.en or base ~1 GB 74 M ~16x
small.en or small ~2 GB 244 M ~6x
medium.en or medium ~5 GB 769 M ~2x
large ~10 GB 1550 M 1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.