FastCGI support for Kaldi. It allows Kaldi based speech recognition to be used though Apache or Nginx (or any other that support FastCGI) HTTP servers. It also contains simple HTML-based client, that allows testing Kaldi speech recognitionfrom a web page.
Apache 2.0
This guide will help you to download and build your own simple ASR web-service based on Kaldi ASR code.
Let's create a directory where all data will be downloaded and built.
mkdir ~/apiai
cd ~/apiai
You are free to choose any other name and path you wish to, but will have to keep in mind that your name differs from the name given in the guide.
Due to server code is based on Kaldi almost all prerequisites matches to Kaldi ones. Besides that a FastCGI library is required to communicate with HTTP server.
As a first step you have to clone Kaldi source tree available at https://github.com/kaldi-asr/kaldi:
git clone https://github.com/kaldi-asr/kaldi
This command will clone source tree to kaldi
directory.
To configure and build Kaldi please refer to kaldi/INSTALL
file.
For detailed information please look for Kaldi official instruction:
http://kaldi-asr.org/doc/install.html
There are some extra libraries required. You may install them using system packet manager.
In openSuSE you may run:
$ sudo zypper install FastCGI-devel
It you have Debian or Ubuntu:
$ sudo apt-get install libfcgi-dev
Return to your working directory where you put Kaldi sources
$ cd ~/apiai
and then clone server source code
$ git clone https://github.com/api-ai/asr-server asr-server
It is recommended to checkout code to the same directory where
kaldi-apiai is located to allow configure
tool to detect Kaldi
location automatically.
$ cd asr-server
Before running a make process you have to configure build scripts by running a special utility:
$ ./configure
It will check that all required libraries installed to your system and
also will look for Kaldi libraries in ../kaldi
folder. If you
have Kaldi installed somewhere else you may explicitly pass the
path via --kaldi-root option:
$ ./configure --kaldi-root=<path_to_kaldi>
If configuration process has finished successfully you may begin the building process by running make script:
$ make
When application build complete you need to download language specific data.
Return to your working directory where you put Kaldi sources
$ cd ~/apiai
Builded ASR application uses a Kaldi nnet3 models, which you can get by training a neural network with your personal data set or use a pretrained network provided by us. Currently it is only English model available at https://github.com/api-ai/api-ai-english-asr-model/releases/download/1.0/api.ai-kaldi-asr-model.zip.
$ wget https://github.com/api-ai/api-ai-english-asr-model/releases/download/1.0/api.ai-kaldi-asr-model.zip
Unzip the archive to asr-server
directory.
$ unzip api.ai-kaldi-asr-model.zip
Set the model directory as a working dir:
$ cd api.ai-kaldi-asr-model
There are several ways available to run application. The first one is
to run it as a standalone app listening on socket defined with
--fcgi-socket
option:
$ ../asr-server/fcgi-nnet3-decoder --fcgi-socket=:8000
This command runs application listening on any IP address and port 8000. You are also free to define a path Unix socket, or explicit IP address (in a A.B.C.D:PORT form).
As an alternative way you may use special spawn-fcgi utility:
$ spawn-fcgi -n -p 8000 -- ../asr-server/fcgi-nnet3-decoder
You may use any web-server which have FastCGI support: Apache, Nginx, Lighttpd etc.
openSuSE:
$ sudo zypper in apache2
Debian and Ubuntu:
$ sudo apt-get install apache2
Enable FastCGI proxy module with a2enmod
:
$ sudo a2enmod proxy_fcgi
Then you have to add to Apache2 configuration file following line:
ProxyPass "/asr" "fcgi://localhost:8000/"
If your Apache configured to include all .conf files from /etc/apache2/conf.d folder you may create separate asr_proxy.conf file with following content:
ProxyPass "/asr" "fcgi://localhost:8000/"
Alias /asr-html/ "/home/username/apiai/asr-server/asr-html/"
<Directory "/home/username/apiai/asr-server/asr-html">
Options Indexes MultiViews
AllowOverride None
Require all granted
</Directory>
Now restart Apache:
$ sudo /etc/init.d/apache2 restart
You can download latest sources from official website http://nginx.org/ and build Nginx with yourself or use your system package manager.
openSuSE:
$ sudo zypper install nginx
Debian and Ubuntu:
$ sudo apt-get install nginx
Open nginx.conf and write down the following code:
http {
server {
location /asr {
fastcgi_pass 127.0.0.1:8000;
# Disabling this option invokes immediate sending replies to client
fastcgi_buffering off;
# Disabling this option invokes immediate decoding incoming audio data
fastcgi_request_buffering off;
include fastcgi_params;
}
location /asr-html {
root /home/username/apiai/asr-server/;
index index.html;
}
}
}
This will setup Nginx to pass all requests coming to url /asr directly to ASR service listening 8000 port via FastCGI gate. For detailed information please please refer to nginx documentation (e.g. https://www.nginx.com/resources/wiki/start/topics/examples/fastcgiexample/)
Server accepts raw mono 16-bits 16 KHz PCM data. You can convert your audio using any popular encoding utilities, for instance, you can use ffmpeg:
$ ffmpeg -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw
There is a simple JS implementation that allows you to recognize speech using system mic. Open in your browser:
http://localhost/asr-html/
and follow the instructions on the page.
Now, let’s recognize audio.raw
by calling web-service with curl
utility:
$ curl -H "Content-Type: application/octet-stream" --data-binary @audio.raw http://localhost/asr
On successfull recognition the command will return something like this:
{
"status":"ok",
"data":[{"confidence":0.900359,"text":"HELLO WORLD"}]
}
On error the return value will be like this:
{"status":"error","data":[{"text":"Failed to decode"}]}
There are several parameters to tune up recognition process. All parameters are expected to be passed via query string as web-form fields enumeration (e.g. ?name1=value1&name2=value2
).
Parameter | Description | Acceptable values | Default value |
---|---|---|---|
nbest | Set the number of possible returned values
|
1-10 | 1 |
endofspeech | Enable or disable end-of-speech points during recognition. If endpoint
detected all then current result have returned and the rest data would
be skipped. Also in case of interrupted recognition 2 fields would be added
to response: "interrupted" with value "endofspeech", and "time" with time point
showing the number of milliseconds have been processed.
|
true or false | true |
intermediate | Set time interval in milliseconds between intermediate results while
recognition being in progress.
|
>500 | 0 |
multipart | If enabled the result would be returned as an
HTTP multipart response with "content-type"
set to "multipart/x-mixed-replace" and each response part
has "Content-Disposition" header value equal to "form-data".
Intermediate parts named as "partial" and a final part is named as "result".
|
true or false | false |