OpenAI Whisper API

IF YOU FIND THIS REPOSITORY HELPFUL, PLEASE CONSIDER STARRING IT.

This microservice provides the same API as OpenAI's API.

The authors claim that Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in the paper. The smaller, the better.

Original repository

https://github.com/openai/whisper

Run

git clone https://github.com/matusstas/openai-whisper-microservice.git
cd openai-whisper-microservice
docker-compose up -d --build

Docker Hub

repository

https://hub.docker.com/r/matusstas/openai-whisper-microservice

latest

docker pull matusstas/openai-whisper-microservice:latest

1.0.0

docker pull matusstas/openai-whisper-microservice:1.0.0

Swagger API documentation

After successfully starting the image, the Swagger documentation will be available at http://localhost:80/docs

Endpoints

The API currently provides 12 endpoints, which are divided into 4 groups according to their use.

Model

GET /models-available

Description

Return a list of all available Whisper ASR models.

Response 200

[
  "tiny.en",
  "tiny",
  "base.en",
  "base",
  "small.en",
  "small",
  "medium.en",
  "medium",
  "large-v1",
  "large-v2",
  "large"
]

GET /models-downloading

Description

Return a list of all downloading Whisper ASR models.

Response 200

[
  "tiny.en"
]

GET /models-downloaded

Description

Return a list of all downloaded Whisper ASR models.

Response 200

[
  "tiny.en",
  "tiny"
]

POST /model

Description

Download a Whisper ASR model using background task.

Parameters

model_name
- required
- string (path)
- Name of the model

Response 201

{
  "detail": "Model is being downloaded"
}

Response 400

{
  "detail": "Invalid model"
}

Response 400

{
  "detail": "Model not downloaded yet"
}

Response 409

{
  "detail": "Model already exist"
}

GET /model/{model_name}

Description

Return a Whisper ASR model.

Parameters

model_name
- required
- string (path)
- Name of the model

Response 200

JSON RESPONSE IS TOO LONG TO DISPLAY

Response 400

{
  "detail": "Model not downloaded yet"
}

Response 404

{
  "detail": "Model not found"
}

DELETE /model/{model_name}

Description

Delete a downloaded Whisper ASR model.

Parameters

model_name
- required
- string (path)
- Name of the model

Response 200

{
  "detail": "Model was deleted"
}

Response 400

{
  "detail": "Model not downloaded yet"
}

Response 404

{
  "detail": "Model not found"
}

POST /model/{model_name}/language

Description

Return a sorted list of all detected languages by their score.

Parameters

model_name
- required
- string (path)
- Name of the model

Request body

file
- required
- string ($binary)
- Chosen audiofile

Response 200

{
  "en": 0.38421738147735596,
  "cy": 0.2614089846611023,
  "zh": 0.10288530588150024,
  "nn": 0.04161091521382332,
  "ko": 0.03617018833756447,
  ...
  "uz": 9.862478833611021e-9
}

Response 400

{
  "detail": "Model not multilingual"
}

Response 400

{
  "detail": "Model not downloaded yet"
}

Response 404

{
  "detail": "Model not found"
}

POST /model/{model_name}/transcript

Description

Transcribe audio with a Whisper ASR model.

Parameters

model_name
- required
- string (path)
- Name of the model

Request body

task
- required
- string
- Task: [transcribe, translate]
language_code
- required
- string
- Language code: [af, am, ar, as, az, ..., zh]
media_type
- required
- string
- Media type: [application/json, text/plain]
format
- required
- string
- Output format: [json, srt, tsv, txt, vtt]
file
- required
- string ($binary)
- Chosen audiofile

Response 200

{
  "text": " I found that nothing in life is worthwhile unless you take risks. Nothing. Nelson Mandela said, there is no passion to be found playing small and settling for a life that's less than the one you're capable of living. Now I'm sure in your experiences in school and applying to college and...",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0,
      "end": 7,
      "text": " I found that nothing in life is worthwhile unless you take risks.",
      "tokens": [
        50363,
        314,
        1043,
        326,
        2147,
        287,
        1204,
        318,
        24769,
        4556,
        345,
        1011,
        7476,
        13,
        50713
      ],
      "temperature": 0,
      "avg_logprob": -0.19224673257747166,
      "compression_ratio": 1.5508021390374331,
      "no_speech_prob": 0.013612011447548866
    },
    {
      "id": 1,
      "seek": 0,
      "start": 7,
      "end": 9,
      "text": " Nothing.",
      "tokens": [
        50713,
        10528,
        13,
        50813
      ],
      "temperature": 0,
      "avg_logprob": -0.19224673257747166,
      "compression_ratio": 1.5508021390374331,
      "no_speech_prob": 0.013612011447548866
    },
    {
      "id": 2,
      "seek": 0,
      "start": 9,
      "end": 15,
      "text": " Nelson Mandela said, there is no passion to be found playing small",
      "tokens": [
        50813,
        12996,
        40233,
        531,
        11,
        612,
        318,
        645,
        7506,
        284,
        307,
        1043,
        2712,
        1402,
        51113
      ],
      "temperature": 0,
      "avg_logprob": -0.19224673257747166,
      "compression_ratio": 1.5508021390374331,
      "no_speech_prob": 0.013612011447548866
    },
    {
      "id": 3,
      "seek": 0,
      "start": 15,
      "end": 20,
      "text": " and settling for a life that's less than the one you're capable of living.",
      "tokens": [
        51113,
        290,
        25446,
        329,
        257,
        1204,
        326,
        338,
        1342,
        621,
        262,
        530,
        345,
        821,
        6007,
        286,
        2877,
        13,
        51363
      ],
      "temperature": 0,
      "avg_logprob": -0.19224673257747166,
      "compression_ratio": 1.5508021390374331,
      "no_speech_prob": 0.013612011447548866
    },
    {
      "id": 4,
      "seek": 0,
      "start": 20,
      "end": 24,
      "text": " Now I'm sure in your experiences in school and applying to college and...",
      "tokens": [
        51363,
        2735,
        314,
        1101,
        1654,
        287,
        534,
        6461,
        287,
        1524,
        290,
        11524,
        284,
        4152,
        290,
        986,
        51563
      ],
      "temperature": 0,
      "avg_logprob": -0.19224673257747166,
      "compression_ratio": 1.5508021390374331,
      "no_speech_prob": 0.013612011447548866
    }
  ],
  "language": "en"
}

Response 400

{
  "detail": "Model not downloaded yet"
}

Response 404

{
  "detail": "Model not found"
}

Language

GET /languages

Description

Return all available languages.

Response 200

{
  "af": "afrikaans",
  "am": "amharic",
  "ar": "arabic",
  "as": "assamese",
  "az": "azerbaijani",
  ...
  "zh": "chinese"
}

GET /language/{language_code}

Description

Return an english language name.

Response 200

"italian"

Response 404

{
  "detail": "Language not found"
}

Miscellaneous

GET /cuda

Description

Check whether a GPU with CUDA support is available on the current system. Return boolean value.

Response 200

true

Root

GET /language/{language_code}

Description

Return message from container to check if it is running.

Response 200

{
  "detail": "Whisper API is running"
}

Data Persistence with Docker Volumes

In order not to lose important data, we added volumes. This is ensured by a specific command in the docker-compose.yml file.

Volume: models

All downloaded Whisper ASR models are stored in the /root/.cache/whisper folder. These are the models we work with.

Volume: db

Due to the fact that I also provide management of individual models (download or delete), I needed a simple database that would allow me to do this. So I work with TinyDB (document-oriented database written in pure Python with no external dependencies), where the important file models.json is stored in the db folder.

For example, the data in the file might looks like this:

{
   "_default":{
      "1":{
         "name":"tiny",
         "downloaded":false
      },
      "2":{
         "name":"tiny.en",
         "downloaded":true
      }
   }
}

downloaded: true = Whisper ASR model was downloaded
downloaded: false = Whisper ASR model is still downloading

Star History

Todo

GPU support (I'm working on it right now)

smeggingsmegger/openai-whisper-microservice

OpenAI Whisper API

Available models and languages

Original repository

Run

Docker Hub

Swagger API documentation

Endpoints

Model

Description

Response 200

Description

Response 200

Description

Response 200

Description

Parameters

Response 201

Response 400

Response 400

Response 409

Description

Parameters

Response 200

Response 400

Response 404

Description

Parameters

Response 200

Response 400

Response 404

Description

Parameters

Request body

Response 200

Response 400

Response 400

Response 404

Description

Parameters

Request body

Response 200

Response 400

Response 404

Language

Description

Response 200

Description

Response 200

Response 404

Miscellaneous

Description

Response 200

Root

Description

Response 200

Data Persistence with Docker Volumes

Volume: models

Volume: db

Star History

Todo