alphacep/vosk-asterisk

No recognition data back from Kaldi - Asterisk 18

dima056359 opened this issue · 2 comments

Hi everyone,

Thanks for this beautiful piece of a software. It seems to be working fine with a Python example script feeding the Kaldi server with chunks of .wav file. Same trick I wanted to perform in Asterisk by following official README instructions, however, it doesn't work as intended.

Just FYI this is my setup:

  • Asterisk v.18.6
  • Debian 11
  • Kernel: Linux 5.10.0-21-amd64 x86_64
  • Python 3.9.2

First of all, there is a Kaldi server running inside a Docker container:

foo@bar:/etc/asterisk# docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED          STATUS          PORTS                                       NAMES
2802ee1c618c   alphacep/kaldi-ru:latest   "python3 ./asr_serve…"   36 minutes ago   Up 36 minutes   0.0.0.0:2700->2700/tcp, :::2700->2700/tcp   suspicious_lamport

Next, I've cloned & built Asterisk's git branch code of v18.6 (no luck with latest v18.19 though):

foo@bar:/etc/asterisk# asterisk -rvvvvvvvvvvvvvvvvvvvvvvv
Asterisk 18.6.0, Copyright (C) 1999 - 2021, Sangoma Technologies Corporation and others.
...

So the Asterisk version seems to be okay at this point. The vosk-asterisk lib was also built and installed in to Asterisk's default lib directory as follows:

foo@bar:/etc/asterisk# ls -l /usr/lib/asterisk/modules | grep -i vosk
-rw-r--r-- 1 root root  117350 Jan 31 02:32 res_speech_vosk.a
-rwxr-xr-x 1 root root     990 Jan 31 02:32 res_speech_vosk.la
-rwxr-xr-x 1 root root   73408 Jan 31 02:32 res_speech_vosk.so

These libs were mentioned in modules.conf of the Asterisk, and loaded properly as well:

foo*CLI> module show like vosk
Module                         Description                              Use Count  Status      Support Level
res_speech_vosk.so             Vosk Speech Engine                       0          Running              core
1 modules loaded
foo*CLI> module show like speech
Module                         Description                              Use Count  Status      Support Level
app_speech_utils.so            Dialplan Speech Applications             0          Running              core
res_speech.so                  Generic Speech Recognition API           2          Running              core
res_speech_vosk.so             Vosk Speech Engine                       0          Running              core
3 modules loaded

Here's also a piece of my extensions.conf dialplan code to run speech recognition:

[internal]
exten => 111,1,NoOp()
same => n,Answer()
same => n,SpeechCreate()
same => n,Wait(1)
same => n,SpeechBackground(/var/spool/asterisk/recording/ru-long1, 90)
same => n,Verbose(0,Result was ${SPEECH_TEXT(0)})
same => n,hangup()

So now, when it comes to have all the magic to happen, I dial 111 exten and observe the following:

[Jan 31 02:46:37] NOTICE[169355][C-00000003]: res_speech_vosk.c:204 vosk_recog_start: (vosk) Start recognition
[Jan 31 02:46:38] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:38] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:38] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:38] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:38] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:39] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
[Jan 31 02:46:39] NOTICE[169355][C-00000003]: res_speech_vosk.c:164 vosk_recog_write: (vosk) Got result: '{
  "partial" : ""
}'
...

... it's just a bunch of empty responses back from the Kaldi server as I understand. There is no real media chunks passing there, so no recognition happens actually. There was an idea to trace VM's internal traffic with tcpdump and I've managed to catch some interesting info:

02:31:40.874029 lo    In  IP localhost.2700 > localhost.57494: Flags [.], ack 138126, win 512, options [nop,nop,TS val 4031340119 ecr 4031340077], length 0
02:31:40.954156 vethb24d8fa P   IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1123:1145, ack 138126, win 2105, options [nop,nop,TS val 3874166157 ecr 1350685573], length 22
02:31:40.954171 docker0 In  IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1123:1145, ack 138126, win 2105, options [nop,nop,TS val 3874166157 ecr 1350685573], length 22
02:31:40.954200 docker0 Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [.], ack 1145, win 501, options [nop,nop,TS val 1350685695 ecr 3874166157], length 0
02:31:40.954205 vethb24d8fa Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [.], ack 1145, win 501, options [nop,nop,TS val 1350685695 ecr 3874166157], length 0
02:31:40.954319 lo    In  IP localhost.2700 > localhost.57494: Flags [P.], seq 1123:1145, ack 138126, win 512, options [nop,nop,TS val 4031340200 ecr 4031340077], length 22
02:31:40.954328 lo    In  IP localhost.57494 > localhost.2700: Flags [.], ack 1145, win 512, options [nop,nop,TS val 4031340200 ecr 4031340200], length 0
02:31:41.022153 lo    In  IP localhost.57494 > localhost.2700: Flags [P.], seq 138126:141334, ack 1145, win 512, options [nop,nop,TS val 4031340267 ecr 4031340200], length 3208
02:31:41.022194 lo    In  IP localhost.2700 > localhost.57494: Flags [.], ack 141334, win 495, options [nop,nop,TS val 4031340267 ecr 4031340267], length 0
02:31:41.022281 docker0 Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [P.], seq 138126:141334, ack 1145, win 501, options [nop,nop,TS val 1350685763 ecr 3874166157], length 3208
02:31:41.022286 vethb24d8fa Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [P.], seq 138126:141334, ack 1145, win 501, options [nop,nop,TS val 1350685763 ecr 3874166157], length 3208
02:31:41.022308 vethb24d8fa P   IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [.], ack 141334, win 2155, options [nop,nop,TS val 3874166226 ecr 1350685763], length 0
02:31:41.022314 docker0 In  IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [.], ack 141334, win 2155, options [nop,nop,TS val 3874166226 ecr 1350685763], length 0
02:31:41.124809 vethb24d8fa P   IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1145:1167, ack 141334, win 2155, options [nop,nop,TS val 3874166328 ecr 1350685763], length 22
02:31:41.124816 docker0 In  IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1145:1167, ack 141334, win 2155, options [nop,nop,TS val 3874166328 ecr 1350685763], length 22
02:31:41.124837 docker0 Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [.], ack 1167, win 501, options [nop,nop,TS val 1350685866 ecr 3874166328], length 0
02:31:41.124840 vethb24d8fa Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [.], ack 1167, win 501, options [nop,nop,TS val 1350685866 ecr 3874166328], length 0
02:31:41.124906 lo    In  IP localhost.2700 > localhost.57494: Flags [P.], seq 1145:1167, ack 141334, win 512, options [nop,nop,TS val 4031340370 ecr 4031340267], length 22
02:31:41.124915 lo    In  IP localhost.57494 > localhost.2700: Flags [.], ack 1167, win 512, options [nop,nop,TS val 4031340370 ecr 4031340370], length 0
02:31:41.211333 lo    In  IP localhost.57494 > localhost.2700: Flags [P.], seq 141334:144542, ack 1167, win 512, options [nop,nop,TS val 4031340457 ecr 4031340370], length 3208
02:31:41.211389 lo    In  IP localhost.2700 > localhost.57494: Flags [.], ack 144542, win 495, options [nop,nop,TS val 4031340457 ecr 4031340457], length 0
02:31:41.211601 docker0 Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [P.], seq 141334:144542, ack 1167, win 501, options [nop,nop,TS val 1350685953 ecr 3874166328], length 3208
02:31:41.211612 vethb24d8fa Out IP 172.17.0.1.51092 > 172.17.0.2.2700: Flags [P.], seq 141334:144542, ack 1167, win 501, options [nop,nop,TS val 1350685953 ecr 3874166328], length 3208
02:31:41.211645 vethb24d8fa P   IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [.], ack 144542, win 2205, options [nop,nop,TS val 3874166415 ecr 1350685953], length 0
02:31:41.211655 docker0 In  IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [.], ack 144542, win 2205, options [nop,nop,TS val 3874166415 ecr 1350685953], length 0
02:31:41.216790 vethb24d8fa P   IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1167:1189, ack 144542, win 2205, options [nop,nop,TS val 3874166420 ecr 1350685953], length 22
02:31:41.216803 docker0 In  IP 172.17.0.2.2700 > 172.17.0.1.51092: Flags [P.], seq 1167:1189, ack 144542, win 2205, options [nop,nop,TS val 3874166420 ecr 1350685953], length 22

Please pay your attention to the very end of each string of this output - you should see the packet lengths are pretty much the same, implying the abscense of any significant media payload inside the IP packets flowing between the Asterisk and Kaldi server. When I was testing Kaldi with the Python script, tracing all the IP traffic again, it had different packet lengths all the time, so I assume there was a real media exchange as it should be.

Last thing, this is the file I'm trying to play and recognize:

/var/spool/asterisk/recording/ru-long1.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz

Let me know if I could give you more information on my setup, what was configured and how, show you some logs etc. Thank you!

Empty partials mean incoming audio is just silent. Probably sip client doesn't record audio somehow. You can dump call to a file and listen.

you should see the packet lengths are pretty much the same

It sends pcm data in equal chunks so it is expected.

Last thing, this is the file I'm trying to play and recognize:

In the dialplan it is only played to the client, yo do not send it for recognition.

Hi Nikolai,

In the dialplan it is only played to the client, yo do not send it for recognition.

That was the main problem. I just got the whole idea of the demo wrong, so you are right. I thought the SpeechBackground() app is going to send the .wav file right to the Kaldi, but after digging into Asterisk docs I realized my assumptions were wrong.

And the funny fact is, that I was testing this demo in a silent room with no music or people talking around me, so that is the reason why nothing was recognized because... nothing to capture lol. Thank you for your help! Cheers.