/audioserver

audioserver (AS) is a state-of-the-art backend webservice for transcribing (decoding) audio files (multilingual) with automatic speech recognition (ASR) technology in real-time via standard https requests.

audioserver

audioserver (AS) is a state-of-the-art backend webservice for transcribing (decoding) audio files (multilingual) with automatic speech recognition (ASR) technology in real-time via standard https requests.

Version: 1.0

Since: July 2022

by Cristian Tejedor-García.

Centre for Language and Speech Technology (CLST), Radboud University Nijmegen.

How it works

The main idea is that AS provides a word hypothesis response in time-marked conversation (ctm) format after decoding an audio file (located in your computer or in the cloud) in real-time.

Option A:

Upload&Decode an audio file from your local computer:

sequenceDiagram
	participant Client
	participant AS.Login
	Client->>AS.Login: login(username,password)
	AS.Login-->>Client: token
	Client->>AS.Upload: upload(audioFilePath, token)
	AS.Upload-->>Client: fileId
	Client->>AS.Decode: decode(fileId, languageCode, metadata, token)
	Note right of AS.Decode: Expect some latency<br/>depending on the audio<br/>file length and<br/>ASR model complexity
	AS.Decode-->>Client: ctm
Loading

Option B:

Download&Decode an audio file from the cloud:

sequenceDiagram
	participant Client
	participant AS.Login
	Client->>AS.Login: login(username,password)
	AS.Login-->>Client: token
	Client->>AS-Download: download(audioFileUrl, token)
	AS-Download-->>Client: fileId
	Client->>AS.Decode: decode(fileId, languageCode, metadata, token)
	Note right of AS.Decode: Expect some latency<br/>depending on the audio<br/>file length and<br/>ASR model complexity
	AS.Decode-->>Client: ctm
Loading

Features

  1. Easy deployment and development: Docker + UNIX/Linux + Node.JS Express + MongoDB + Kaldi.

  2. RESTful API: Client apps can send/receive information easily through a JSON REST API.

  3. Flexible ASR infrastructure: Possibility of choosing the language/acoustic models and the beam parameter on-the-fly for decoding.

  4. Multilingual ASR: There is no limitation on the number of ASR decoding languages since we can choose which one to use on-the-fly.

  5. Very low latency: The response time will depend on internet connection speed (client and server), length of the audio file and ASR model complexity.

  6. 'Unlimited' parallel connections/requests: The server can process in parallel as many connections as possible (depending on the number of CPUs of your machine).

  7. Tracking of users's audiofiles: MongoDB database + folder (per user) with audio files with unique IDs.

  8. Web logs: The web server keeps a trace of all users' interaction with the system.

  9. Full compatibility with any client-app/device: The communication protocol can be adjusted easily.

  10. API documentation: Swagger (standard http protocol).

  11. Easy communication between independent Docker containers: Unix pipelines.

  12. Security:

    1. JSON Web Tokens (JWT) for login authentication and secure requests.
    2. https for encrypted and secure data transmission client-server.
    3. Audio files can be removed after obtaining the transcription (the user can select this option on-the-fly).
    4. Strong login passwords (bcrypt).
    5. Login: Maximum number of wrong attempts –> 1 day ban (we can change this value, of course).
    6. Fully customizable ticket system for requests: Max. number of requests: regular vs. admin users. Currently: 50 requests/hour (default users). Admin users have no limitations. This value can be set for every user individually.
    7. Register: Email confirmation token.
    8. Requests: Required and validation parameters rules for correct requests.
    9. Type of the audio file: .wav, .ogg, etc. (fully customizable).
    10. Size limit of the audio file: 5 MB (fully customizable).

Installation

  1. Download the source code of this repository into a folder (audioserver).

cd audioserver

  1. Install Docker and some utilities on your Linux machine following this file:

vim READMEs/README-docker_first_install.md

  1. Set the corresponding values to these two files:

vim .mongo-variables.env

vim .web-variables.env

  1. Start the docker-compose.yml

./_startdocker.sh

Source code

The source code will be available soon.

API documentation

Swagger : https://restasr.cls.ru.nl/api-docs

Frontend

Audioserverfront : https://github.com/cristiantg/audioserverfront

How to cite this work

If you use this software for research/work, you can cite this repository giving credit to, at least, Cristian and CLST.

@misc{cristiantg2023audioserver,
  title={audioserver},
  author={Tejedor-Garcia, Cristian},
  journal={GitHub repository},
  year={2023},
  publisher={GitHub},
  howpublished = {\url{https://github.com/cristiantg/audioserver}}
}

Contact

Cristian Tejedor-García : cristian [dot] tejedorgarcia [at] ru [dot] nl

Centre for Language and Speech Technology (CLST), Radboud University