marianne-m/brouhaha-vad

snr and c50 detailed arrays are in an unexpected length

Closed this issue · 5 comments

Hi, thanks for your work. I ran brouhaha on a file of length 3:37:57.326, which is 13077.326 seconds. I examined the c50 and detailed_snr_labels .npy files, and their shape was (756644,). I expected that 756644 * 16 / 1000 would equal the length of the clip (16ms per frame, as per the paper), but I saw it is not the case.

The ratio between the length of the audio file and the length of the arrays came out to 17.28ms per frame. I manually verified this by looking graphing the SNR and seeing that it lines up with speech starting and ending, only when I used 17.28/1000 as the conversion factor from frames to seconds. Where does the number come from? It doesn't come out to a whole number of samples in 16KHz (it's around 276.5 samples per frame, though maybe padding can explain the .5?)

An interesting side-note is that the .rttm file has correct timings, so it's not that everything is off.

  1. Same issue. But the length/frame was 16.86ms and 269.82 samples/frame in my case. (The input audio is 16KHz, and the length is 30.304s, but len(c50)*16/1000 is 28.752s )

  2. Another question is, what kind of speaker can be identified by the model? My data contains multiple speakers, basically mothers and children, and the model only recognized the mother's vocals but not the children's.

  3. Negative values are given to SNR, but the range should be between 0 to 30dB written in the paper?

Hi to both of you,

  1. The frame duration is precisely 16.875 ms (270 samples per frame) and is determined by the stride parameter in the SincNet architecture. I realized the paper was a bit misleading, and will fix this with the exact frame duration!
    @shenberg I'm not sure why you end up with a frame duration of 17.3ms. I ran Brouhaha on a 16-hour-long audio file and got the right frame duration. Are you sure your audio file is sampled at 16kHz? Could you send it to me so that I can try to reproduce this behavior?

  2. I wouldn't use this model to detect child speech as it has only been trained on adult speakers. For child speech detection (but no SNR/C50 estimation), you could maybe try this github repo. Maybe using the information returned by both models would be enough for your use case (SNR/C50 from Brouhaha and CHI from vtc).

  3. Audio segments are indeed augmented with additive background noise using a [0,30] dB range to create the training set. However, when we recompute the SNR at the frame level, it can take values outside of this range (including negative values).

From the paper:
Screenshot from 2023-11-03 18-40-18

These are the parameters that determine the output range. For the SNR, it is bounded between -15 and 80 dB.

Best,
Marvin

Hi, thanks for the response, and thanks even more for the paper and model!

I did try to find the stride of sincnet but it wasn't mentioned directly in the SincNet paper and was not trivial to figure out from the code :)

Regarding the file, I unfortunately can't share it as it doesn't belong to me. I'm certain it's sampled at exactly 16,000Hz and the lengths and ratios I reported are accurate. See this notebook snippet:
image
(I also opened the file in audacity to make sure the length in samples and sample rate are reported correctly).

Thanks again for the response. I'll try to replicate this with a file I can share.

Awesome! Please let me know of your progress! I feel like the reason behind this strange behavior is obvious, but I can't see it yet!

Closing for now! Feel free to re-open if you're still struggling to figure out what the model returns :)