audioserver (AS) is a state-of-the-art backend webservice for transcribing (decoding) audio files (multilingual) with automatic speech recognition (ASR) technology in real-time via standard https requests.
Version: 1.0
Since: July 2022
by Cristian Tejedor-García.
Centre for Language and Speech Technology (CLST), Radboud University Nijmegen.
The main idea is that AS provides a word hypothesis response in time-marked conversation (ctm) format after decoding an audio file (located in your computer or in the cloud) in real-time.
Upload&Decode an audio file from your local computer:
sequenceDiagram
participant Client
participant AS.Login
Client->>AS.Login: login(username,password)
AS.Login-->>Client: token
Client->>AS.Upload: upload(audioFilePath, token)
AS.Upload-->>Client: fileId
Client->>AS.Decode: decode(fileId, languageCode, metadata, token)
Note right of AS.Decode: Expect some latency<br/>depending on the audio<br/>file length and<br/>ASR model complexity
AS.Decode-->>Client: ctm
Download&Decode an audio file from the cloud:
sequenceDiagram
participant Client
participant AS.Login
Client->>AS.Login: login(username,password)
AS.Login-->>Client: token
Client->>AS-Download: download(audioFileUrl, token)
AS-Download-->>Client: fileId
Client->>AS.Decode: decode(fileId, languageCode, metadata, token)
Note right of AS.Decode: Expect some latency<br/>depending on the audio<br/>file length and<br/>ASR model complexity
AS.Decode-->>Client: ctm
-
Easy deployment and development: Docker + UNIX/Linux + Node.JS Express + MongoDB + Kaldi.
-
RESTful API: Client apps can send/receive information easily through a JSON REST API.
-
Flexible ASR infrastructure: Possibility of choosing the language/acoustic models and the beam parameter on-the-fly for decoding.
-
Multilingual ASR: There is no limitation on the number of ASR decoding languages since we can choose which one to use on-the-fly.
-
Very low latency: The response time will depend on internet connection speed (client and server), length of the audio file and ASR model complexity.
-
'Unlimited' parallel connections/requests: The server can process in parallel as many connections as possible (depending on the number of CPUs of your machine).
-
Tracking of users's audiofiles: MongoDB database + folder (per user) with audio files with unique IDs.
-
Web logs: The web server keeps a trace of all users' interaction with the system.
-
Full compatibility with any client-app/device: The communication protocol can be adjusted easily.
-
API documentation: Swagger (standard http protocol).
-
Easy communication between independent Docker containers: Unix pipelines.
-
Security:
- JSON Web Tokens (JWT) for login authentication and secure requests.
- https for encrypted and secure data transmission client-server.
- Audio files can be removed after obtaining the transcription (the user can select this option on-the-fly).
- Strong login passwords (bcrypt).
- Login: Maximum number of wrong attempts –> 1 day ban (we can change this value, of course).
- Fully customizable ticket system for requests: Max. number of requests: regular vs. admin users. Currently: 50 requests/hour (default users). Admin users have no limitations. This value can be set for every user individually.
- Register: Email confirmation token.
- Requests: Required and validation parameters rules for correct requests.
- Type of the audio file: .wav, .ogg, etc. (fully customizable).
- Size limit of the audio file: 5 MB (fully customizable).
- Download the source code of this repository into a folder (audioserver).
cd audioserver
- Install Docker and some utilities on your Linux machine following this file:
vim READMEs/README-docker_first_install.md
- Set the corresponding values to these two files:
vim .mongo-variables.env
vim .web-variables.env
- Start the docker-compose.yml
./_startdocker.sh
The source code will be available soon.
Swagger
: https://restasr.cls.ru.nl/api-docs
Audioserverfront
: https://github.com/cristiantg/audioserverfront
If you use this software for research/work, you can cite this repository giving credit to, at least, Cristian and CLST.
@misc{cristiantg2023audioserver,
title={audioserver},
author={Tejedor-Garcia, Cristian},
journal={GitHub repository},
year={2023},
publisher={GitHub},
howpublished = {\url{https://github.com/cristiantg/audioserver}}
}
Cristian Tejedor-García : cristian [dot] tejedorgarcia [at] ru [dot] nl
Centre for Language and Speech Technology (CLST), Radboud University