pocketsphinx takes a long time (.8 - 1.1 seconds) to send partial results on a wake word detection

Question

pocketsphinx takes a long time (.8 - 1.1 seconds) to send partial results on a wake word detection

BeanStalka opened this issue 8 years ago · 23 comments

I am using pocketsphinx to spot one key phrase. I've reduced the dictionary to the two words that the phrase consists of.

Detection rates are excellent since I have dialed in my thresholds per documentation.

I am using Xamarin, so I have wrapped pocketsphinx in a Bound
Library to gain access.

The problem is, that when I do get a detection it takes anywhere from .3 seconds (which i think is excellent)
to 1.1 seconds (not so good).

I would like to get this time down to as short as possible, since i am switching from pocketsphinx to another service for the speech recognition.

I am aware that this time will never be 0, but I was hoping that maybe removing some of the files that are read in during StartListening() would help to reduce this.

Any suggestions are welcome, please see my attached implementation.

PocketSphinxWakeWordEngine.txt

nshmyrev commented 8 years ago

It is ok.

Answer 1 · 2017-05-24T13:40:00.000Z

Calling Java from Xamarin is not a good idea probably, I would try to work with pocketsphinx through interop API instead.

Answer 2 · 2017-05-24T13:44:13.000Z

I have had no issues binding others via the Bound Libraries.

Do you have any other suggestions? Do I need to load the LM? Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

Answer 3 · 2017-05-24T13:49:52.000Z

Do I need to load the LM?

No, it is not needed

Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

Unlikely it affects your response time

Answer 4 · 2017-05-24T14:00:15.000Z

Do you have an example of someone using the interop API?

If not, thank you for your help and quick responses.

Answer 5 · 2017-05-24T14:02:45.000Z

There was a discussion here:

https://sourceforge.net/p/cmusphinx/discussion/help/thread/fb985d4d/

Answer 6 · 2017-05-24T14:09:02.000Z

Also, instead of switching quickly but I would try to just wait for the end of utterance and then if keyphrase is detected forward the whole chunk (you can get it with getRawData) to another service. Switch will be still distinguishable for the users no matter how fast you switch.

Answer 7 · 2017-05-24T14:19:28.000Z

Wow, that is a great idea.

That way if they say "Keyphrase", please turn on the lights. I would send that whole chunk for analysis.

I am assuming that I could call GetRawData in the ICMURecognizer.OnEndOfSpeech() hook.

I am using the SpeechRecognizer, does that expose the Decoder so that I can call GetRawData()

Any quick broad stroke example would be appreciated. You are the SME on this so any help you give would be amazing.

Answer 8 · 2017-05-24T14:59:02.000Z

FYI - I'm attempting to call GetRawdata, but the short[] is empty.
MY Updates:
1.) OnPartialResults detected the wakeword and sets a flag
2.) OnEndOfSpeech check flag and stops the recognizer i am using .Stop(), should I use .Cancel()?
3.) OnResult checks flag (_keyWordDetected) and then calls GetRawdata

seems like Im missing something...
PocketSphinxWakeWordEngine.txt

Answer 9 · 2017-05-24T15:10:32.000Z

Call d.setRawdataSize(300000) in decoder setup

Answer 10 · 2017-05-24T16:15:20.000Z

AWESOME

That works, I now have a full short array.

I am converting it to a byte array and will send it to BING.

My guesses at the format on the chunk from pocketsphinx:
1.) sample rate of 8000kz
2.) 16BitPCMFormat

Are these assumptions correct?

Thanks you again for all of your help so far! If this works its effectively alleviated the issues I was having with the user needing to pause after the wake word.

Answer 11 · 2017-05-24T16:34:04.000Z

You are welcome. Default sample rate is 16khz.

Answer 12 · 2017-05-24T19:51:10.000Z

I am having a heck of a time getting this to work.

Bing is expecting 16bit PCM format with a 16khz sample rate.

Would this be what the Decoder would supply? Everything I'm running across as far a documentation says that it must be a audio format issue.

UPDATE: Turns out i was not parsing the array correctly when sending it up to bing

attached please find the updated code that is working for me.

I cannot tell you how much I appreciate all of your help.

Большое спасибо Bal'shoye spaseeba

I owe you one.

BingSpeechToTextEngine.txt

Answer 13 · 2017-05-25T15:17:44.000Z

I have another question that I am hoping you can help me with.

if pocketpshinx is listening for a bit before I get raw results, the rawdata array gets rather large.

Is there a better way to manage this rawdata so that it only contains the audio immediately after the wake word till the end of the utterance.

Trim the front of the array as it were.

I was hoping there was a way to flush the array OnPartialResults when the wake word is detected. Or maybe I should just work backwards from the end of the array with timers.

I would love to hear your thoughts on this.

Answer 14 · 2017-05-25T15:18:42.000Z

end utterance and restart it again in every endOfSpeech

Answer 15 · 2017-05-25T15:23:08.000Z

UPDATE:
I am using this and it seems to have solved the issue
void ICMURecognizer.OnBeginningOfSpeech()
{
_pocketSphinxRecognizer.Decoder.EndUtt();
_pocketSphinxRecognizer.Decoder.StartUtt();
}
Do you forsee any issues with this approach?

Answer 16 · 2017-05-31T14:54:47.000Z

I could not find a place for the StartUtt and EndUtt that would trim up the buffer and give me the data array for what was said.

My suggestion above causes really unstable results.

Can you tell me exactly where I would need to place those calls to the methods above so that I can minimize the .SetRawdataSize(3000000) and also minimize the the buffer resetting in the middle of an utterance.

Or how I can reset the buffer on OnBeginningOfSpeech

Answer 17 · 2017-06-01T16:03:04.000Z

in onEndOfSpeech try to call recognizer.cancel() and recognizer.startListening().

Answer 18 · 2017-06-01T18:45:39.000Z

Unfortunately that wont work.

If there is enough silence to fill the buffer before I say the wake word, then the buffer will be silence only.

Once EndOfSpeech Is called, recognizer.cancel() and recognizer.startListening() will clear the buffer and begin listening again.

If the buffer is full before your utterance, your utterance is not captured by it.

I have begun using the timeout, and OnTimeout method. Stopping and Starting listening. This unfortunately give me a deadspot when the the recognizer stops and starts OnTimeout.

Answer 19 · 2017-06-15T16:13:20.000Z

Buffer is circular, it should contain only latest audio data.

Answer 20 · 2017-06-15T18:10:12.000Z

My goal is to capture everything in the buffer after wake word detection until end of utterance.

Given the fact that the buffer is circular, how would you suggest that I accomplish this feat?

Right now I am

1.)setting _junkIndexBeforeWakePhrase = _pocketSphinxRecognizer.Decoder.GetRawdata().Length; during the OnPartialResult call back when the wake phrase is detected.

2.) OnEndOfSpeech I am getting the buffer, and slicing out anything before _junkIndexBeforeWakePhrase

3.) When the buffer gets full I cancel then start the recognizer.

Answer 21 · 2017-06-18T17:00:27.000Z

I don't think you need to slice, you can just use buffer as is, it contains something like last several seconds of audio and you can feed them into recognizer.

Answer 22 · 2017-06-19T17:24:19.000Z

Nickolay, I ran a test and do not believe that the buffer is implemented as a circular buffer. Is there a setting that I need to flip to get this behavior? I made sure my utterance was timed with the end of the buffer. If it was circular, you would expect half the utterance to be at the end of the buffer and the other half to be at the front of the buffer as it wraps around. This is not the case. I took the entire array of raw data, and saw the first half of my utterance at the end of the file, but nowhere did I see the second half of the utterance. Please advise Andrew Glatts Sr. Software Engineer P: (610)-999-6993

…

On Sun, Jun 18, 2017 at 1:00 PM, Nickolay V. Shmyrev < ***@***.***> wrote: I don't think you need to slice, you can just use buffer as is, it contains something like last several seconds of audio and you can feed them into recognizer. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACII54HExTTGumnU4ALpC1ajCMwwrNRoks5sFVesgaJpZM4NlGZJ> .