cmusphinx/pocketsphinx-android

pocketsphinx takes a long time (.8 - 1.1 seconds) to send partial results on a wake word detection

BeanStalka opened this issue · 23 comments

I am using pocketsphinx to spot one key phrase. I've reduced the dictionary to the two words that the phrase consists of.

Detection rates are excellent since I have dialed in my thresholds per documentation.

I am using Xamarin, so I have wrapped pocketsphinx in a Bound
Library to gain access.

The problem is, that when I do get a detection it takes anywhere from .3 seconds (which i think is excellent)
to 1.1 seconds (not so good).

I would like to get this time down to as short as possible, since i am switching from pocketsphinx to another service for the speech recognition.

I am aware that this time will never be 0, but I was hoping that maybe removing some of the files that are read in during StartListening() would help to reduce this.

Any suggestions are welcome, please see my attached implementation.

PocketSphinxWakeWordEngine.txt

Calling Java from Xamarin is not a good idea probably, I would try to work with pocketsphinx through interop API instead.

I have had no issues binding others via the Bound Libraries.

Do you have any other suggestions? Do I need to load the LM? Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

Do I need to load the LM?

No, it is not needed

Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

Unlikely it affects your response time

Do you have an example of someone using the interop API?

If not, thank you for your help and quick responses.

Also, instead of switching quickly but I would try to just wait for the end of utterance and then if keyphrase is detected forward the whole chunk (you can get it with getRawData) to another service. Switch will be still distinguishable for the users no matter how fast you switch.

Wow, that is a great idea.

That way if they say "Keyphrase", please turn on the lights. I would send that whole chunk for analysis.

I am assuming that I could call GetRawData in the ICMURecognizer.OnEndOfSpeech() hook.

I am using the SpeechRecognizer, does that expose the Decoder so that I can call GetRawData()

Any quick broad stroke example would be appreciated. You are the SME on this so any help you give would be amazing.

FYI - I'm attempting to call GetRawdata, but the short[] is empty.
MY Updates:
1.) OnPartialResults detected the wakeword and sets a flag
2.) OnEndOfSpeech check flag and stops the recognizer i am using .Stop(), should I use .Cancel()?
3.) OnResult checks flag (_keyWordDetected) and then calls GetRawdata

seems like Im missing something...
PocketSphinxWakeWordEngine.txt

Call d.setRawdataSize(300000) in decoder setup

AWESOME

That works, I now have a full short array.

I am converting it to a byte array and will send it to BING.

My guesses at the format on the chunk from pocketsphinx:
1.) sample rate of 8000kz
2.) 16BitPCMFormat

Are these assumptions correct?

Thanks you again for all of your help so far! If this works its effectively alleviated the issues I was having with the user needing to pause after the wake word.

You are welcome. Default sample rate is 16khz.

I am having a heck of a time getting this to work.

Bing is expecting 16bit PCM format with a 16khz sample rate.

Would this be what the Decoder would supply? Everything I'm running across as far a documentation says that it must be a audio format issue.

UPDATE: Turns out i was not parsing the array correctly when sending it up to bing

attached please find the updated code that is working for me.

I cannot tell you how much I appreciate all of your help.

Большое спасибо Bal'shoye spaseeba

I owe you one.

BingSpeechToTextEngine.txt

I have another question that I am hoping you can help me with.

if pocketpshinx is listening for a bit before I get raw results, the rawdata array gets rather large.

Is there a better way to manage this rawdata so that it only contains the audio immediately after the wake word till the end of the utterance.

Trim the front of the array as it were.

I was hoping there was a way to flush the array OnPartialResults when the wake word is detected. Or maybe I should just work backwards from the end of the array with timers.

I would love to hear your thoughts on this.

end utterance and restart it again in every endOfSpeech

UPDATE:
I am using this and it seems to have solved the issue
void ICMURecognizer.OnBeginningOfSpeech()
{
_pocketSphinxRecognizer.Decoder.EndUtt();
_pocketSphinxRecognizer.Decoder.StartUtt();
}
Do you forsee any issues with this approach?

It is ok.

I could not find a place for the StartUtt and EndUtt that would trim up the buffer and give me the data array for what was said.

My suggestion above causes really unstable results.

Can you tell me exactly where I would need to place those calls to the methods above so that I can minimize the .SetRawdataSize(3000000) and also minimize the the buffer resetting in the middle of an utterance.

Or how I can reset the buffer on OnBeginningOfSpeech

in onEndOfSpeech try to call recognizer.cancel() and recognizer.startListening().

Unfortunately that wont work.

If there is enough silence to fill the buffer before I say the wake word, then the buffer will be silence only.

Once EndOfSpeech Is called, recognizer.cancel() and recognizer.startListening() will clear the buffer and begin listening again.

If the buffer is full before your utterance, your utterance is not captured by it.

I have begun using the timeout, and OnTimeout method. Stopping and Starting listening. This unfortunately give me a deadspot when the the recognizer stops and starts OnTimeout.

Buffer is circular, it should contain only latest audio data.

My goal is to capture everything in the buffer after wake word detection until end of utterance.

Given the fact that the buffer is circular, how would you suggest that I accomplish this feat?

Right now I am

1.)setting _junkIndexBeforeWakePhrase = _pocketSphinxRecognizer.Decoder.GetRawdata().Length; during the OnPartialResult call back when the wake phrase is detected.

2.) OnEndOfSpeech I am getting the buffer, and slicing out anything before _junkIndexBeforeWakePhrase

3.) When the buffer gets full I cancel then start the recognizer.

I don't think you need to slice, you can just use buffer as is, it contains something like last several seconds of audio and you can feed them into recognizer.