MycroftAI/mycroft-precise

Model predictions vary significantly depending on position of wakeword in audio

dscripka opened this issue · 4 comments

Describe the bug
When using the Python bindings for Precise, I've noticed that the model predictions can vary substantially depending on where in the input audio the wake word is located. For example, The plot below shows the default "hey mycroft" model score for two repetitions of the same audio clip, where the only difference is that the second clip has one additional frame (1024 samples) of zero-padding compare to the first clip:

image

I'm currently doing some evaluation of Precise compared to other wakeword solutions, and this behavior is making it difficult to accurately assess performance as the length and padding of the test clips can cause significant differences in false-positive and false-negative metrics due to this behavior.

Is this behavior expected? If so, is there a recommended way to evaluate the model to minimize such effects?

To Reproduce
The following code should re-produce the plot above, using the attached audio file below and model versions referenced in the code:

test_clip.zip

import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
from precise_runner import PreciseEngine

# Set chunk size
chunk_size = 1024

# Load clip
sr, dat = scipy.io.wavfile.read("path/to/attached/wav/file")

# Create versions of clip
version1 = np.concatenate((
        np.zeros(chunk_size*50, dtype=np.int16),
        dat,
    )
)

version2 = np.concatenate((
        np.zeros(chunk_size*51, dtype=np.int16), # this one simply has one more chunk of zeros compared to version1
        dat,
    )
)
     
ps = []
for clip in [version1, version2]:
    # Load Precise model for each clip
    P = PreciseEngine(
        './precise-engine_0.3.0_x86_64/precise-engine/precise-engine',
        "models/hey-mycroft_C1_E6000_B5000_D0.2_R20_S0.8.pb",
        chunk_size=chunk_size*2 # in bytes, not samples
    )
    P.start()
    
    for i in range(0, clip.shape[0]-chunk_size, chunk_size):
        if i < chunk_size*5: # don't store first few predictions to avoid model initialization behavior
            P.get_prediction(clip[i:i+chunk_size].tobytes())
            continue
        else:
            ps.append(P.get_prediction(clip[i:i+chunk_size].tobytes()))
        
    P.stop()
    
plt.plot(ps)
plt.xlabel("Chunk Index")
plt.ylabel("Model Score")

Expected behavior
Precise should have very similar scores for otherwise identical audio that just occurs at a different position in the audio stream.

Precise is meant to operate on a continuous stream of audio. For this reason, it only it trained to output a high score for the frames immediately after the wake word. If you want to test a model against an entire audio sample you should take the maximum output value of all outputs.

Let me know if that makes sense.

@MatthewScholefield, yes, that what's the plot is showing, the model score for every frame of the two input audio samples. So for the first clip, the maximum score is around ~0.23 and for the second clip (where the only difference is a single extra frame of zero-padding), the maximum score is only around ~0.06.

It might be clearer if I make the two plots separate. So this is the model's score for all of the frames of the first input clip:

plot2

And this is the model score for the second input clip:

plot1

So even if I use the maximum of all the outputs, I get a very different value for an otherwise identical audio clip.

Oh, I see, thanks for clarifying. This is definitely not intended. Just for some clarity on how it works, it feeds audio features (MFCCs) for the last buffer_t seconds to independently produce one output. You can see the value of buffer_t by looking in the .params file. Overall there are two hypotheses that I can come up with as to why this could occur:

  1. There's abnormal audio in the start portion of the audio buffer. Ideally the buffer would be completely filled with normal mic noise to prevent abnormal behavior introduced by zero padding. Just to sanity check, how long are the samples?
  2. It's also possible the sample alignment is affecting the MFCC features and either the model is just not very robust or the audio window function for the MFCC features is non-optimal and causing a big change in the inputs.

Looks like for this model buffer_t is set at 1.5 seconds. The input audio clip is just over 4 seconds, with just about ~1.5 seconds of just background mic noise before the wake word.

That's a great point about zero-padding potentially causing an issue with the MFCC features. Here are some plots where I just duplicate the initial mic background noise for ~1 second as padding instead of zeros (so now there is ~2.5 seconds of background noise before the wake word):

Clip 1

image

Clip 2

image

Where again, the only difference between clip 1 and clip 2 is 1024 more samples of background noise padding in clip 2, the actual wake word utterance is identical. There still seems to be a significant difference in the two, in both maximum score and overall trend of the frame scores over time.